Closed Ph0rk0z closed 1 year ago
I'm encountering the same issue. Llama 2 70B, 8bit quantized. 2x A100. Compiled with:
make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=compute_80
Command:
./main -ngl 83 -m ../transformers_cache/llama-2-70b.Q8_0.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "Never gonna give"
Fails with:
CUDA error 1 at ggml-cuda.cu:7044: invalid argument
current device: 1
Whereas setting -ngl 0 and running it entirely on CPU runs fine (if slowly).
I assume it can be related to my changes for CUDA memory pools. Once #3931 is merged try to recompile with GGML_CUDA_FORCE_CUSTOM_MEMORY_POOL and double check.
@Ph0rk0z Can you bisect at which commit the failure occurs?
@ggerganov I am seeing the same error. git bisect
reveals that commit d6069051de7165a4e06662c89257f5d2905bb156 (#3903) seems to be the culprit.
PS: As per https://github.com/ggerganov/llama.cpp/pull/2470#issuecomment-1769068705 I am compiling with LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0
. But the same error happens without that option.
Mine has been broken since: https://github.com/ggerganov/llama.cpp/pull/2268
First it would crash out like setting a too high n_batch does when loading the model. i.e, Trying to allocate massive amounts of system ram. After the memory pool commits it gives the error above. The memory poor PR does not fix it but at least avoids the crash.
For what it's worth I am seeing this in a fresh build of llama.cpp as well. I am building via the llama_cpp_python package!
(task, pid=12595) ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
(task, pid=12595) ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
(task, pid=12595) ggml_init_cublas: found 4 CUDA devices:
(task, pid=12595) Device 0: Tesla T4, compute capability 7.5
(task, pid=12595) Device 1: Tesla T4, compute capability 7.5
(task, pid=12595) Device 2: Tesla T4, compute capability 7.5
(task, pid=12595) Device 3: Tesla T4, compute capability 7.5
...
(task, pid=12595) CUDA error 1 at /tmp/pip-install-bxeyyykh/llama-cpp-python_262979da943c43fa9967b3c0a61f8580/vendor/llama.cpp/ggml-cuda.cu:7036: invalid argument
(task, pid=12595) current device: 1
same error with cuda 12.3:
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 4 CUDA devices: Device 0: NVIDIA A10, compute capability 8.6 Device 1: NVIDIA A10, compute capability 8.6 Device 2: NVIDIA A10, compute capability 8.6 Device 3: NVIDIA A10, compute capability 8.6
... llm_load_tensors: using CUDA for GPU acceleration ggml_cuda_set_main_device: using device 0 (NVIDIA A10) as main device llm_load_tensors: mem required = 86.05 MB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 35/35 layers to GPU llm_load_tensors: VRAM used: 4807.06 MB .................................................................................................. llama_new_context_with_model: n_ctx = 3900 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: offloading v cache to GPU llama_kv_cache_init: offloading k cache to GPU llama_kv_cache_init: VRAM kv self = 487.50 MB llama_new_context_with_model: kv self size = 487.50 MB llama_build_graph: non-view tensors processed: 740/740 llama_new_context_with_model: compute buffer total size = 282.00 MB llama_new_context_with_model: VRAM scratch buffer: 275.37 MB llama_new_context_with_model: total VRAM used: 5569.93 MB (model: 4807.06 MB, context: 762.87 MB) AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
CUDA error 1 at /tmp/pip-install-5bufkrrh/llama-cpp-python_9a816a9490ba42a78dfd85cdba57cabf/vendor/llama.cpp/ggml-cuda.cu:7036: invalid argument current device: 1
Same error here with 2 x T4s using the Python package. It happened to me when redeploying my production Kubernetes environment. I had to quickly downgrade to 1 GPU to get the environment back up. I really do need this fixed ASAP as 1 GPU won't be able to handle load at peak times very well.
Please test changes from https://github.com/ggerganov/llama.cpp/pull/3931. CUDA pools are optional now.
After reverting cuda pool stuff it appears to be working again.
I am getting the same error: should I install a specific version? I installed : CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir .................................................................................................... llama_new_context_with_model: n_ctx = 3000 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: kv self size = 1500.00 MB llama_build_graph: non-view tensors processed: 740/740 llama_new_context_with_model: compute buffer total size = 10.02 MB llama_new_context_with_model: VRAM scratch buffer: 3.40 MB llama_new_context_with_model: total VRAM used: 3170.43 MB (model: 3167.03 MB, context: 3.40 MB) AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | time took to retrive documents is 0.6446716785430908
CUDA error 1 at /tmp/pip-install-1ypw1658/llama-cpp-python_1c1bc0be5c7249408c254fa56f97252b/vendor/llama.cpp/ggml-cuda.cu:7036: invalid argument current device: 1
@RachelShalom Try to retest the latest version.
I installed llama cpp a few hours ago and got this so error. I assume I installed the latest. unless the fix mentioned here is not on a release
I installed llama cpp a few hours ago and got this so error. I assume I installed the latest. unless the fix mentioned here is not on a release
Did you install llama.cpp or llama-cpp-python? I really don't know how quickly llama.cpp propagates to llama-cpp-python.
python using this: CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python
and I am using lancgchain to load the model. I updated langchain and Now I have a new Error:
CUDA error 222 at /tmp/pip-install-qcfy69x9/llama-cpp-python_d60a2a3fe09943d5b39a16dab77b98a7/vendor/llama.cpp/ggml-cuda.cu:7043: the provided PTX was compiled with an unsupported toolchain. current device: 0
nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Jul_11_02:20:44_PDT_2023 Cuda compilation tools, release 12.2, V12.2.128 Build cuda_12.2.r12.2/compiler.33053471_0
I used llama-cpp-python with langchain, and got the same error:
I installed:
CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCMAKE_CUDA_FLAGS='-DGGML_CUDA_FORCE_CUSTOM_MEMORY_POOL'" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
and I also upgraded langchain to 0.0.330
here are the output:
" ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6 Device 1: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1
...
llm_load_tensors: ggml ctx size = 0.11 MB llm_load_tensors: using CUDA for GPU acceleration ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3060) as main device llm_load_tensors: mem required = 172.97 MB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloaded 32/35 layers to GPU llm_load_tensors: VRAM used: 3718.38 MB .................................................................................................. llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: kv self size = 256.00 MB llama_build_graph: non-view tensors processed: 740/740 llama_new_context_with_model: compute buffer total size = 7.18 MB llama_new_context_with_model: VRAM scratch buffer: 0.55 MB llama_new_context_with_model: total VRAM used: 3718.93 MB (model: 3718.38 MB, context: 0.55 MB)
CUDA error 1 at /tmp/pip-install-2o911nrr/llama-cpp-python_7b2f2508c89b451280d9116461f3c9cf/vendor/llama.cpp/ggml-cuda.cu:7036: invalid argument current device: 1 "
I have two different cards and they worked well with the compiled llama.cpp. But I got error when I tried with llama-cpp-python. :(
I'm using llama.cpp python too and I just git pull instead of using his cherrypicked revision. Sometimes that's good and sometimes that's bad.
Same issue for me on 2x A100 80GB PCIe setup with https://github.com/ggerganov/llama.cpp/pull/3586. Running with CUDA_VISIBLE_DEVICES=1
works for models which fit. Building with LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0
doesn't help.
My setup works on #3901. Will try to see if I manage to find a commit (e.g. #3903, as suspected in the thread) that breaks it.
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Git llama.cpp with python bindings.
Expected Behavior
Inference works like before.
Current Behavior
Inference fails and llama.cpp crashes.
Environment and Context
python 3.10 / cuda 11.8
Failure Information (for bugs)
Relevant Code
I have some printf's for Nvllink as you see so the line numbers are a little off but here is the snippet that set sit off.
One of the args to cudamemcpy async is invalid. I haven't checked yet which one does it. The day before, it was trying to allocate 5TB of system ram after loading the model but subsequent commits fixed that up. Waited a little to see if that would happen with this since the code is so new and I can't access github from that machine so I have to bring the logs here.
It does it with both P40s and 3090s and is independent of whether I force MMQ or not.