ggerganov / llama.cpp

LLM inference in C/C++
MIT License
66.18k stars 9.51k forks source link

Bug: crash with CUDA graphs on A100 #9727

Closed ggerganov closed 1 week ago

ggerganov commented 1 week ago

What happened?

I am currently running some tests on A100 and llama.cpp crashes when CUDA graphs are enabled. Here are repro steps:

cmake -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=80 ..
make -j

./bin/llama-cli -m ../models/llama-3.1-8b/ggml-model-f16.gguf -p "I believe the meaning of life is" -n -1 -s 2 -ngl 99 -c 4096

llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =  1002.00 MiB
llm_load_tensors:      CUDA0 buffer size = 14315.02 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   512.00 MiB
llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   296.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    16.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
llama_init_from_gpt_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 124

system_info: n_threads = 124 (n_threads_batch = 124) / 124 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 

sampler seed: 1
sampler params: 
    repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

I believe the meaning of life is the way we act in which we perceive our personality and the way of life experience and how many. You need to make this.
How to the one you the most of the way and the one and this one the one the other this in the other this in this is this.
The
The one in this
the this in this and this.
The
and this in this in this is this is the one the oneThe is this in this in this in this in this in this in this in this in this in this in this in this in this in this in this in this in this in this in/home/ggerganov/development/llama.cpp/ggml/src/ggml-cuda.cu:106: CUDA error
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at /home/ggerganov/development/llama.cpp/ggml/src/ggml-cuda.cu:2476
  cudaStreamSynchronize(cuda_ctx->stream())
[New LWP 3036247]
[New LWP 3036251]
[New LWP 3036303]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f0015b87c7f in __GI___wait4 (pid=3036530, stat_loc=0x7ffe83b62044, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:27
27  ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x00007f0015b87c7f in __GI___wait4 (pid=3036530, stat_loc=0x7ffe83b62044, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:27
27  in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x00007f0016020a73 in ggml_abort () from /home/ggerganov/development/llama.cpp/build-cuda-2/ggml/src/libggml.so
#2  0x00007f0016105876 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/ggerganov/development/llama.cpp/build-cuda-2/ggml/src/libggml.so
#3  0x00007f0016106d5b in ggml_backend_cuda_synchronize(ggml_backend*) () from /home/ggerganov/development/llama.cpp/build-cuda-2/ggml/src/libggml.so
#4  0x00007f001606b60d in ggml_backend_sched_synchronize () from /home/ggerganov/development/llama.cpp/build-cuda-2/ggml/src/libggml.so
#5  0x00007f001cf29534 in llama_synchronize () from /home/ggerganov/development/llama.cpp/build-cuda-2/src/libllama.so
#6  0x00007f001cf2d539 in llama_get_logits_ith () from /home/ggerganov/development/llama.cpp/build-cuda-2/src/libllama.so
#7  0x000055d200455ef8 in gpt_sampler_sample(gpt_sampler*, llama_context*, int, bool) ()
#8  0x000055d2003db119 in main ()
[Inferior 1 (process 3036246) detached]
Aborted (core dumped)

Build info:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA A100-SXM4-40GB, compute capability 8.0, VMM: yes
build: 3870 (841713e1) with cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 for x86_64-linux-gnu
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   54C    P0   100W / 400W |   8337MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

The model is F16 LLaMA 3.1. The command crashes consistently though at different times after the start.

It stops crashing if I add GGML_CUDA_DISABLE_GRAPHS=1.

@agray3 Do you have ideas what might be the issue? Do you observe the same crash with A100?

Name and Version

version: 3870 (841713e1) built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

agray3 commented 1 week ago

I've tried to reproduce this using https://huggingface.co/second-state/Meta-Llama-3.1-8B-Instruct-GGUF/blob/f24f6a255a71bd6211dccbc9bb67f83c51b7edab/Meta-Llama-3.1-8B-Instruct-f16.gguf on A100-SXM and I've not been able to - it's running indefinitely for me without any crash. My A100 is 80GB rather than 40GB but I'd be surprised if that difference was important. I'm assuming that model should be close enough to what you are testing with, but if not can you please share (or point to) your gguf file? Also, do you have any other A100 cards to test on, to rule out any issue with that specific card? Might also be worth trying to build with an older architecture, e.g. -DCMAKE_CUDA_ARCHITECTURES="75" (which will be run via PTX JIT compilation), to check whether the issue is related to building with 80.

ggerganov commented 1 week ago

Thanks, I think I've made some mistake when installing CUDA. Sorry for the noise. Will report back if there is still issue after I reinstall. Closing for now.