ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.91k stars 9.74k forks source link

Multi-GPU has been broken for me recently. ggml-cuda.cu:7068: invalid argument #3930

Closed Ph0rk0z closed 1 year ago

Ph0rk0z commented 1 year ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Git llama.cpp with python bindings.

Expected Behavior

Inference works like before.

Current Behavior

Inference fails and llama.cpp crashes.

Environment and Context

python 3.10 / cuda 11.8

Failure Information (for bugs)


llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.26 MB
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llm_load_tensors: mem required  =  140.89 MB
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 83/83 layers to GPU
llm_load_tensors: VRAM used: 39362.61 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 1280.00 MB
llama_new_context_with_model: kv self size  = 1280.00 MB
llama_build_graph: non-view tensors processed: 1844/1844
llama_new_context_with_model: compute buffer total size = 574.63 MB
llama_new_context_with_model: VRAM scratch buffer: 568.00 MB
llama_new_context_with_model: total VRAM used: 41210.61 MB (model: 39362.61 MB, context: 1848.00 MB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
2023-11-02 17:16:43 INFO:Loaded the model in 37.54 seconds.
Enabled NVLINK P2P 0->1
Enabled NVLINK P2P 1->0

CUDA error 1 at /home/supermicro/ai/llama-cpp-python-gguf-cuda/vendor/llama.cpp/ggml-cuda.cu:7068: invalid argument
current device: 1

Relevant Code

I have some printf's for Nvllink as you see so the line numbers are a little off but here is the snippet that set sit off.


                // copy src0, src1 to device if necessary
                if (src1->backend == GGML_BACKEND_GPU && src1_is_contiguous) {
                    if (id != g_main_device) {
                        if (convert_src1_to_q8_1) {
                            char * src1_ddq_i_source = src1_ddq[g_main_device] + src1_ddq_i_offset;
                     ****>       CUDA_CHECK(cudaMemcpyAsync(src1_ddq_i, src1_ddq_i_source, src1_ncols*src1_padded_col_size*q8_1_ts/q8_1_bs,
                                                    cudaMemcpyDeviceToDevice, stream));
                        } else {
                            float * src1_ddf_i_source = (float *) src1_extra->data_device[g_main_device];
                            src1_ddf_i_source += (i0*ne11 + src1_col_0) * ne10;
                            CUDA_CHECK(cudaMemcpyAsync(src1_ddf_i, src1_ddf_i_source, src1_ncols*ne10*sizeof(float),
                                                    cudaMemcpyDeviceToDevice, stream));
                        }
                    }

One of the args to cudamemcpy async is invalid. I haven't checked yet which one does it. The day before, it was trying to allocate 5TB of system ram after loading the model but subsequent commits fixed that up. Waited a little to see if that would happen with this since the code is so new and I can't access github from that machine so I have to bring the logs here.

It does it with both P40s and 3090s and is independent of whether I force MMQ or not.

neel-alex commented 1 year ago

I'm encountering the same issue. Llama 2 70B, 8bit quantized. 2x A100. Compiled with:

make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=compute_80

Command:

./main -ngl 83 -m ../transformers_cache/llama-2-70b.Q8_0.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "Never gonna give"

Fails with:

CUDA error 1 at ggml-cuda.cu:7044: invalid argument current device: 1

Whereas setting -ngl 0 and running it entirely on CPU runs fine (if slowly).

young-developer commented 1 year ago

I assume it can be related to my changes for CUDA memory pools. Once #3931 is merged try to recompile with GGML_CUDA_FORCE_CUSTOM_MEMORY_POOL and double check.

ggerganov commented 1 year ago

@Ph0rk0z Can you bisect at which commit the failure occurs?

sgoll commented 1 year ago

@ggerganov I am seeing the same error. git bisect reveals that commit d6069051de7165a4e06662c89257f5d2905bb156 (#3903) seems to be the culprit.

PS: As per https://github.com/ggerganov/llama.cpp/pull/2470#issuecomment-1769068705 I am compiling with LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0. But the same error happens without that option.

Ph0rk0z commented 1 year ago

Mine has been broken since: https://github.com/ggerganov/llama.cpp/pull/2268

First it would crash out like setting a too high n_batch does when loading the model. i.e, Trying to allocate massive amounts of system ram. After the memory pool commits it gives the error above. The memory poor PR does not fix it but at least avoids the crash.

yourbuddyconner commented 1 year ago

For what it's worth I am seeing this in a fresh build of llama.cpp as well. I am building via the llama_cpp_python package!

(task, pid=12595) ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
(task, pid=12595) ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
(task, pid=12595) ggml_init_cublas: found 4 CUDA devices:
(task, pid=12595)   Device 0: Tesla T4, compute capability 7.5
(task, pid=12595)   Device 1: Tesla T4, compute capability 7.5
(task, pid=12595)   Device 2: Tesla T4, compute capability 7.5
(task, pid=12595)   Device 3: Tesla T4, compute capability 7.5
...
(task, pid=12595) CUDA error 1 at /tmp/pip-install-bxeyyykh/llama-cpp-python_262979da943c43fa9967b3c0a61f8580/vendor/llama.cpp/ggml-cuda.cu:7036: invalid argument
(task, pid=12595) current device: 1
moatftw commented 1 year ago

same error with cuda 12.3:

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 4 CUDA devices: Device 0: NVIDIA A10, compute capability 8.6 Device 1: NVIDIA A10, compute capability 8.6 Device 2: NVIDIA A10, compute capability 8.6 Device 3: NVIDIA A10, compute capability 8.6

... llm_load_tensors: using CUDA for GPU acceleration ggml_cuda_set_main_device: using device 0 (NVIDIA A10) as main device llm_load_tensors: mem required = 86.05 MB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 35/35 layers to GPU llm_load_tensors: VRAM used: 4807.06 MB .................................................................................................. llama_new_context_with_model: n_ctx = 3900 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: offloading v cache to GPU llama_kv_cache_init: offloading k cache to GPU llama_kv_cache_init: VRAM kv self = 487.50 MB llama_new_context_with_model: kv self size = 487.50 MB llama_build_graph: non-view tensors processed: 740/740 llama_new_context_with_model: compute buffer total size = 282.00 MB llama_new_context_with_model: VRAM scratch buffer: 275.37 MB llama_new_context_with_model: total VRAM used: 5569.93 MB (model: 4807.06 MB, context: 762.87 MB) AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

CUDA error 1 at /tmp/pip-install-5bufkrrh/llama-cpp-python_9a816a9490ba42a78dfd85cdba57cabf/vendor/llama.cpp/ggml-cuda.cu:7036: invalid argument current device: 1

riley-access-labs commented 1 year ago

Same error here with 2 x T4s using the Python package. It happened to me when redeploying my production Kubernetes environment. I had to quickly downgrade to 1 GPU to get the environment back up. I really do need this fixed ASAP as 1 GPU won't be able to handle load at peak times very well.

young-developer commented 1 year ago

Please test changes from https://github.com/ggerganov/llama.cpp/pull/3931. CUDA pools are optional now.

Ph0rk0z commented 1 year ago

After reverting cuda pool stuff it appears to be working again.

RachelShalom commented 1 year ago

I am getting the same error: should I install a specific version? I installed : CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir .................................................................................................... llama_new_context_with_model: n_ctx = 3000 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: kv self size = 1500.00 MB llama_build_graph: non-view tensors processed: 740/740 llama_new_context_with_model: compute buffer total size = 10.02 MB llama_new_context_with_model: VRAM scratch buffer: 3.40 MB llama_new_context_with_model: total VRAM used: 3170.43 MB (model: 3167.03 MB, context: 3.40 MB) AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | time took to retrive documents is 0.6446716785430908

CUDA error 1 at /tmp/pip-install-1ypw1658/llama-cpp-python_1c1bc0be5c7249408c254fa56f97252b/vendor/llama.cpp/ggml-cuda.cu:7036: invalid argument current device: 1

young-developer commented 1 year ago

@RachelShalom Try to retest the latest version.

RachelShalom commented 1 year ago

I installed llama cpp a few hours ago and got this so error. I assume I installed the latest. unless the fix mentioned here is not on a release

ccbadd commented 1 year ago

I installed llama cpp a few hours ago and got this so error. I assume I installed the latest. unless the fix mentioned here is not on a release

Did you install llama.cpp or llama-cpp-python? I really don't know how quickly llama.cpp propagates to llama-cpp-python.

RachelShalom commented 1 year ago

python using this: CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

and I am using lancgchain to load the model. I updated langchain and Now I have a new Error:

CUDA error 222 at /tmp/pip-install-qcfy69x9/llama-cpp-python_d60a2a3fe09943d5b39a16dab77b98a7/vendor/llama.cpp/ggml-cuda.cu:7043: the provided PTX was compiled with an unsupported toolchain. current device: 0

nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Jul_11_02:20:44_PDT_2023 Cuda compilation tools, release 12.2, V12.2.128 Build cuda_12.2.r12.2/compiler.33053471_0

davidleo1984 commented 1 year ago

I used llama-cpp-python with langchain, and got the same error: I installed: CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCMAKE_CUDA_FLAGS='-DGGML_CUDA_FORCE_CUSTOM_MEMORY_POOL'" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir and I also upgraded langchain to 0.0.330

here are the output:

" ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6 Device 1: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1

...

llm_load_tensors: ggml ctx size = 0.11 MB llm_load_tensors: using CUDA for GPU acceleration ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3060) as main device llm_load_tensors: mem required = 172.97 MB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloaded 32/35 layers to GPU llm_load_tensors: VRAM used: 3718.38 MB .................................................................................................. llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: kv self size = 256.00 MB llama_build_graph: non-view tensors processed: 740/740 llama_new_context_with_model: compute buffer total size = 7.18 MB llama_new_context_with_model: VRAM scratch buffer: 0.55 MB llama_new_context_with_model: total VRAM used: 3718.93 MB (model: 3718.38 MB, context: 0.55 MB)

CUDA error 1 at /tmp/pip-install-2o911nrr/llama-cpp-python_7b2f2508c89b451280d9116461f3c9cf/vendor/llama.cpp/ggml-cuda.cu:7036: invalid argument current device: 1 "

I have two different cards and they worked well with the compiled llama.cpp. But I got error when I tried with llama-cpp-python. :(

Ph0rk0z commented 1 year ago

I'm using llama.cpp python too and I just git pull instead of using his cherrypicked revision. Sometimes that's good and sometimes that's bad.

jezzarax commented 1 year ago

Same issue for me on 2x A100 80GB PCIe setup with https://github.com/ggerganov/llama.cpp/pull/3586. Running with CUDA_VISIBLE_DEVICES=1 works for models which fit. Building with LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0 doesn't help. My setup works on #3901. Will try to see if I manage to find a commit (e.g. #3903, as suspected in the thread) that breaks it.