Closed phishmaster closed 5 months ago
Are you getting correct results when you use the llama.cpp binaries directly without any Python bindings? If not, are you getting correct results when you compile with LLAMA_CUDA_FORCE_MMQ
?
Thanks for your response.
First I recompiled llama.cpp with the suggested flag LLAMA_CUDA_FORCE_MMQ e.g. LLAMA_CUDA=1 LLAMA_CUDA_FORCE_MMQ=1 make
When I do not have "-ngl 40" it seems to give the correct answer. " ./main_gpu -m /data/DemoV1/Model4Demo/openhermes-2.5-mistral-7b.Q5_K_M.gguf \ -c 16192 -b 1024 -n 256 --keep 48 \ --repeat_penalty 5.0 --color -i \ -r "User:" -f prompts/chat-with-bob.txt ... User: Hello, Bob. Bob: Hello. How may I help you today? User: Please tell me the largest city in Europe. Bob: Sure. The largest city in Europe is Moscow, the capital of Russia. User:what is the capital city of France? Bob: The Capital City Of Paris.<|im_end|>
" However when I ran it with -ngl 40 this is the response ... User:what is the capital city of France? ################################################################################################################## ################################################################################################################## ###################
From: Johannes Gäßler @.> Sent: Friday, May 3, 2024 4:15 AM To: ggerganov/llama.cpp @.> Cc: Huy Vu @.>; Author @.> Subject: Re: [ggerganov/llama.cpp] Significantly different results (and WRONG) inference when GPU is enabled. (Issue #7048)
Caution: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Are you getting correct results when you use the llama.cpp binaries directly without any Python bindings? If not, are you getting correct results when you compile with LLAMA_CUDA_FORCE_MMQ?
— Reply to this email directly, view it on GitHubhttps://github.com/ggerganov/llama.cpp/issues/7048#issuecomment-2092529864, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BIHUORKZ5IDGM7ZSNYYI2BLZANBRPAVCNFSM6AAAAABHEIE4XSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJSGUZDSOBWGQ. You are receiving this because you authored the thread.Message ID: @.***>
Sorry, I wanted to check this issue but forgot. Did you download a ready-made GGUF file from Huggingface or did you convert it yourself? If it's the former, can you provide a link to the exact file you downloaded?
No problem and thanks for responding. I downloaded the model from Huggingface at
Huy
From: Johannes Gäßler @.> Sent: Wednesday, May 8, 2024 11:32 AM To: ggerganov/llama.cpp @.> Cc: Huy Vu @.>; Author @.> Subject: Re: [ggerganov/llama.cpp] Significantly different results (and WRONG) inference when GPU is enabled. (Issue #7048)
Caution: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Sorry, I wanted to check this issue but forgot. Did you download a ready-made GGUF file from Huggingface or did you convert it yourself? If it's the former, can you provide a link to the exact file you downloaded?
— Reply to this email directly, view it on GitHubhttps://github.com/ggerganov/llama.cpp/issues/7048#issuecomment-2100853306, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BIHUORPU7HZDYA53C354M53ZBJARPAVCNFSM6AAAAABHEIE4XSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBQHA2TGMZQGY. You are receiving this because you authored the thread.Message ID: @.***>
I cannot reproduce the issue on master. Can you re-download the model and check that this issue isn't due to a corrupted file?
Here is my git master
* 83330d8c - (HEAD -> master, origin/master, origin/HEAD) main : add --conversation / -cnv flag (#7108) (2 hours ago) [Dawid Potocki]
* 465263d0 - sgemm : AVX Q4_0 and Q8_0 (#6891) (2 hours ago) [Eve]
* 911b3900 - server : add_special option for tokenize endpoint (#7059) (4 hours ago) [Johan]
clean and rebuild with
LLAMA_CUDA=1 LLAMA_CUDA_FORCE_MMQ=1 make clean
LLAMA_CUDA=1 LLAMA_CUDA_FORCE_MMQ=1 make
re-downloaded the model (also matches with my previously downloaded file)
(base) hvu@Kaui:/data/DemoV1/Model4Demo$ md5sum openhermes-2.5-mistral-7b.Q5_K_M_v1.gguf
f7faa7e315e81c3778aae55fcb5fc02c openhermes-2.5-mistral-7b.Q5_K_M_v1.gguf
(pytorch_py39_cu11.8) hvu@Kaui:~/Demo_v1/llama.cpp_v3$ ./main_gpu -m /data/DemoV1/Model4Demo/openhermes-2.5-mistra
l-7b.Q5_K_M.gguf \
-c 16192 -b 1024 -n 256 --keep 48 \
--repeat_penalty 5.0 --color -i \
-r "User:" -f prompts/chat-with-bob.txt
...
llm_load_print_meta: EOT token = 32000 '<|im_end|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
Device 1: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size = 0.15 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors: CPU buffer size = 4893.00 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 16384
llama_new_context_with_model: n_batch = 1024
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 2048.00 MiB
llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.12 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 1147.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 40.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 356
system_info: n_threads = 24 / 48 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI
= 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1
| VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: interactive mode on.
...
ob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:what is the capital city of France?
Bob: The Capital City Of france Is Paris.<|im_end|>
=========================================== I tried various values of -ngl and all seems to return garbage
(pytorch_py39_cu11.8) hvu@Kaui:~/Demo_v1/llama.cpp_v3$ ./main_gpu -m /data/DemoV1/Model4Demo/openhermes-2.5-mistra
l-7b.Q5_K_M.gguf -c 16192 -b 1024 -n 256 --keep 48 --repeat_penalty 5.0 --color -i -r "User:" -f promp
ts/chat-with-bob.txt -ngl 40
Log start
main: build = 2817 (83330d8c)
...
ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
Device 1: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size = 0.44 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 85.94 MiB
llm_load_tensors: CUDA0 buffer size = 2352.25 MiB
llm_load_tensors: CUDA1 buffer size = 2454.81 MiB
...
ob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:what is the capital city of France?
##################################################################################################################
##################################################################################################################
###################
------------------------------------------------------
Other values of -ngl 16
...
llm_load_tensors: ggml ctx size = 0.44 MiB
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloaded 16/33 layers to GPU
llm_load_tensors: CPU buffer size = 4893.00 MiB
llm_load_tensors: CUDA0 buffer size = 1160.19 MiB
llm_load_tensors: CUDA1 buffer size = 1192.06 MiB
.....
User:what is the capital city of France?
#<s>▅
$<s>#"
"!<s>
</s>
$
!<s>!!"
"
$#
""<s>
Pretty much the same for -ngl 8
If I remember correctly the output
#<s>▅
$<s>#"
"!<s>
</s>
$
!<s>!!"
"
$#
""<s>
is effectively what you get when a NO_DEVICE_CODE
isn't being correctly triggered. My intuition is that this issue is specific to a V100 GPU (and maybe also the CUDA version). If possible, please check the following:
export CUDA_VISIBLE_DEVICES=0
(makes it so that only the first GPU is used).n_vocab
of 32000 (check the console log, Mistral base model should have that vocab size).Also: with SXM your V100s are effectively NVLinked, right? Can you check results when compiling with LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0
and LLAMA_CUDA_PEER_MAX_BATCH_SIZE=999999
?
Remake with suggested flags
LLAMA_CUDA=1 LLAMA_CUDA_FORCE_MMQ=1 CUDA_VISIBLE_DEVICES=0 LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0 LLAMA_CUDA_PEER_MAX_BATCH_SIZE=999999 make
I think I already have CUDA 12
(llama_cpp_py39) hvu@Kaui:~/Demo_v1/llama.cpp_v3$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
Now for the run
(pytorch_py39_cu11.8) hvu@Kaui:~/Demo_v1/llama.cpp_v3$ export CUDA_VISIBLE_DEVICES=0
(pytorch_py39_cu11.8) hvu@Kaui:~/Demo_v1/llama.cpp_v3$ ./main_gpu -m /data/DemoV1/Model4Demo/openhermes-2.5-mistral-7b.Q5_K_M.gguf -c 16192 -b 1024 -n 256 --keep 48 --repeat_penalty 5.0 --color -i -r "User:" -f prompts/chat-with-bob.txt -ngl 8
...
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size = 0.30 MiB
llm_load_tensors: offloading 8 repeating layers to GPU
llm_load_tensors: offloaded 8/33 layers to GPU
llm_load_tensors: CPU buffer size = 4893.00 MiB
llm_load_tensors: CUDA0 buffer size = 1192.06 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 16384
llama_new_context_with_model: n_batch = 1024
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 1536.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 512.00 MiB
llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.12 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 1147.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 40.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 268
...
User:what is the capital city of France?
▅</s>"!"!
"
</s></s▅"$#<s># "<s>
Same results
When checking the NVCC version your shell prefix is (llama_cpp_py39)
. When you actually run the model the prefix is (pytorch_py39_cu11.8)
. Are you sure that in both cases CUDA 12 is being used?
Also, I didn't mean to compile with both LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0
and LLAMA_CUDA_PEER_MAX_BATCH_SIZE=999999
at the same time. I meant to test either option individually. But if you still get incorrect results with CUDA_VISIBLE_DEVICES=0
that's not going to be the problem anyways.
Remake in base conda environment (default nvcc)
(base) hvu@Kaui:~/Demo_v1/llama.cpp_v3$ LLAMA_CUDA=1 LLAMA_CUDA_FORCE_MMQ=1 CUDA_VISIBLE_DEVICES=0 LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0 make
....
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
Device 1: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size = 0.44 MiB
llm_load_tensors: offloading 8 repeating layers to GPU
llm_load_tensors: offloaded 8/33 layers to GPU
llm_load_tensors: CPU buffer size = 4893.00 MiB
llm_load_tensors: CUDA0 buffer size = 588.06 MiB
llm_load_tensors: CUDA1 buffer size = 604.00 MiB
...
Same results
</s>What is the capital city of France?
#"<s> </s>
<s>$
#▅"#
!<s><s>/s><s$<s>" <s>"
llama_print_timings: load time = 2860.74 ms
llama_print_timings: sample time = 70.35 ms / 161 runs ( 0.44 ms per token, 2288.46 tokens per second)
llama_print_timings: prompt eval time = 360.26 ms / 9 tokens ( 40.03 ms per token, 24.98 tokens per second)
llama_print_timings: eval time = 12501.69 ms / 160 runs ( 78.14 ms per token, 12.80 tokens per second)
llama_print_timings: total time = 13594.98 ms / 169 tokens
(base) hvu@Kaui:~/Demo_v1/llama.cpp_v3$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
(base) hvu@Kaui:~/Demo_v1/llama.cpp_v3$
Do you get any errors with compute-sanitizer
? Run compute-sanitizer ./main -m ..
No errors, and the "compute-sanitizer" didn't seem to help; however, it seems to work better if I use -ngl -1 instead of any specific values. Does that help?
-ngl -1
is effectively the same as -ngl 0
.
What driver version are you using? Run nvidia-smi
.
(base) hvu@Kaui:~/Demo_v1/llama.cpp_v3$ nvidia-smi
Wed May 8 13:37:21 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla V100-SXM2-16GB Off | 00000000:1D:00.0 Off | Off |
| N/A 40C P0 54W / 300W | 1673MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 Tesla V100-SXM2-16GB Off | 00000000:1E:00.0 Off | Off |
| N/A 46C P0 54W / 300W | 623MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1557 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 2365880 C ...envs/pytorch_py39_cu11.8/bin/python 584MiB |
| 0 N/A N/A 2665195 C ...envs/pytorch_py39_cu11.8/bin/python 498MiB |
| 0 N/A N/A 2683103 C ...envs/pytorch_py39_cu11.8/bin/python 584MiB |
| 1 N/A N/A 1557 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2365880 C ...envs/pytorch_py39_cu11.8/bin/python 308MiB |
| 1 N/A N/A 2683103 C ...envs/pytorch_py39_cu11.8/bin/python 308MiB |
+-----------------------------------------------------------------------------------------+
According to the HuggingFace repository, the model was made with llama.cpp revision 629f917
. Do you get correct results with that revision?
Here is my current repo.
(base) hvu@Kaui:~/Demo_v1/llama.cpp_629f917$ ./main -m /data/DemoV1/Model4Demo/openhermes-2.5-mistral-7b.Q5_K_M.gg
uf -c 16192 -b 1024 -n 256 --keep 48 --repeat_penalty 5.0 --color -i -r "User:" -p "What is the capita
l city of France?" -ngl 40
warning: not compiled with GPU offload support, --n-gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
Log start
main: build = 1477 (629f917c)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1715190618
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /data/DemoV1/Model4Demo/openherm
es-2.5-mistral-7b.Q5_K_M.gguf (version GGUF V3 (latest))
...
llama_new_context_with_model: n_ctx = 16192 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: kv self size = 2024.00 MB llama_build_graph: non-view tensors processed: 740/740 llama_new_context_with_model: compute buffer total size = 2141.88 MB
...
What is the capital city of France? Paris. The French government has its seat in Paris, which also serves as an important center for culture and business within Europe due to being one if not THE most famous cities globally known across different industries such arts or fashion among others that contribute significantly towards global economy with their unique products like h aute couture designs from Christian Dior house based right here! Paris is located in the northern part of France, near Normandy and Brittany. It has a population over 2 million people making it one if not THE largest cities globally known across different industries such arts or fashion among others that contribute significantly towards global economy with their unique products like haute couture designs from Christian Dior house based right here! Paris is often referred to as the “City of Love”, and for good reason. The city boasts some amazing architecture, including Notre-Dame Cathedral which has been featured in countless films; it’s also home base during fashion weeks where designers showcase their latest collections on runways around town! Paris was founded by Celtic tribes known as Parisii back before Christ when they settled along the Seine River. Later conquered and ruled successively through Roman, Frankish (Merovingian), Carolingians
Recompile with the right option to enable CUDA and same problem
(base) hvu@Kaui:~/Demo_v1/llama.cpp_629f917$ ./main -m /data/DemoV1/Model4Demo/openhermes-2.5-mistral-7b[377/1833]
uf -c 16192 -b 1024 -n 256 --keep 48 --repeat_penalty 5.0 --color -i -r "User:" -p "What is the capita
l city of France?" -ngl 40
Log start
main: build = 1477 (629f917c)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1715191104
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: Tesla V100-SXM2-16GB, compute capability 7.0
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /data/DemoV1/Model4Demo/openherm
es-2.5-mistral-7b.Q5_K_M.gguf (version GGUF V3 (latest))
...
llm_load_tensors: ggml ctx size = 0.11 MB [17/1833]
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 86.05 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 4807.06 MB
...
What is the capital city of France?###############################################################################
##################################################################################################################
###############################################################
Experimenting with various -ngl values, it seems like keeping it below 20 seems to help for many models. At a certain point it just flipped from working to garbage. In the experiment the model llama-2-13b/ggml-model-q5_K_M.bin works with -ngl at 22 or below.
Here is an example of it working with python binding
...
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
Device 1: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size = 0.55 MiB
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloaded 16[/41](http://kaui:8888/41) layers to GPU
llm_load_tensors: CPU buffer size = 5362.94 MiB
llm_load_tensors: CUDA0 buffer size = 1700.92 MiB
llm_load_tensors: CUDA1 buffer size = 1737.77 MiB
...
output = llm(
"Q: Name all the planets in the solar system? A:", # Prompt
max_tokens=256, # Generate up to 32 tokens, set to None to generate up to the end of the context window
stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)
...
{'id': 'cmpl-6ddf6146-082d-40ed-9188-7acea7ee3f6d', 'object': 'text_completion', 'created': 1715260091, 'model': '/data/llama2/llama-2-13b/ggml-model-q5_K_M.bin', 'choices': [{'text': 'Q: Name all the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 15, 'completion_tokens': 24, 'total_tokens': 39}}
With 16 layers, only 2.2x2GB of VRAM was used. 22 layers, 3x2GB VRAM --> Working
At 23 layers, the answer came back garbage and 3x2GB VRAM usage which is way below the 16x2 GB VRAM available.
llm_load_print_meta: LF token = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
Device 1: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size = 0.55 MiB
llm_load_tensors: offloading 23 repeating layers to GPU
llm_load_tensors: offloaded 23[/41](http://kaui:8888/41) layers to GPU
llm_load_tensors: CPU buffer size = 3882.31 MiB
llm_load_tensors: CUDA0 buffer size = 2545.23 MiB
llm_load_tensors: CUDA1 buffer size = 2374.08 MiB
...
output = llm(
"Q: Name all the planets in the solar system? A:", # Prompt
max_tokens=256, # Generate up to 32 tokens, set to None to generate up to the end of the context window
stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)
...
{'id': 'cmpl-9fc9e4b3-b052-401a-8984-67b72fbe5b31', 'object': 'text_completion', 'created': 1715260540, 'model': '/data/llama2/llama-2-13b/ggml-model-q5_K_M.bin', 'choices': [{'text': 'Q: Name all the planets in the solar system? A: 23,495', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 15, 'completion_tokens': 8, 'total_tokens': 23}}
If the user goes beyond the supported value, there should be a warning or error? Also is there a deterministic way of knowing what value of -ngl will work vs when it will return garbage?
It should work with any value. You could try running the eval-callback
example with CPU and with full offload and try to see what's the first operation that produces significantly different values.
eval-callback_gpu_f16.log eval-callback_cpu.log eval-callback.log The eval-callback logs are attached. gpu_f16 - Uses the f16 model instead of the quantized eval_callback.log - Use the Q5 model cpu - (-ngl 0)
Sorry, eval-callback was broken and the numbers are useless. Please try again with #7184 or after it is merged.
Pulled from master
and reran the eval-callback. Logs are attached eval-callback_q5km_gpu.log eval-callback_q5km_cpu.log eval-callback_f16_gpu.log
Additionally for version 4f02636 -ngl 20 and below seems to work fine, anything above this the results are garbage ..
<s> What is the capital city of France?
▅! "#
<s</s>
##$</s>▅ ▅
</s>
" "</s> ▅
Can you share the full command line that you used to generate the eval-callback logs? What f16 model did you use?
It seems to break down at the end of layer 11. Did you try enabling ECC? (nvidia-smi -e 1
)
Also try using the environment variable CUDA_LAUNCH_BLOCKING=1
.
ggml_debug: ffn_gate_par-11 = (f32) MUL(ffn_silu-11{13824, 2, 1, 1}, ffn_up-11{13824, 2, 1, 1}}) = {13824, 2, 1, 1}
[
[
[ -0.0039, 0.0012, -0.0001, ..., 0.0058, -0.0061, 0.0020],
[ -0.0036, -0.0129, -0.0303, ..., 0.0178, -0.0390, 0.0095],
],
]
sum = -0.059607
ggml_debug: ffn_out-11 = (f32) MUL_MAT(blk.11.ffn_down.weight{13824, 5120, 1, 1}, ffn_gate_par-11{13824, 2, 1, 1}}) = {5120, 2, 1, 1}
[
[
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
],
]
sum = nan
ggml_debug: l_out-11 = (f32) ADD(ffn_out-11{5120, 2, 1, 1}, ffn_inp-11{5120, 2, 1, 1}}) = {5120, 2, 1, 1}
[
[
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
],
]
sum = nan
ggml_debug: norm-12 = (f32) RMS_NORM(l_out-11{5120, 2, 1, 1}, }) = {5120, 2, 1, 1}
[
[
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
],
]
sum = nan
My guess is that this is a hardware failure of some sort. Are you using a custom build these V100 that might not provide enough power or cooling?
I highly doubt that it's enough power or cooling as the source. Mainly it would imply a lot more randomness vs being very deterministic in term of failures at ~20 layers offload.
As for cooling, the server is housed in a rack and air conditioned.
Let me try enabling ECC and send results.
root@Kaui:/data# nvidia-smi -e 1
Enabled ECC support for GPU 00000000:1D:00.0.
Enabled ECC support for GPU 00000000:1E:00.0.
All done.
Reboot required
It's not likely to be an incompatibility with the GPU architecture, in fact the ggml-ci tests every commit on master on a PCIE V100. Whatever the issue is, it seems to be specific to your system. I know that some people have been trying to use V100 on custom builds since they are relatively cheap when bought used, and if this is the case here, I think that the most likely cause is some issue with the build.
We don't have anything "custom" that I am aware of. Pretty much standard server with 2 V100 GPUs. As for software it is Ubuntu 22 LTS and pre-built drivers.
Going to ask our IT folks to run a complete VRAM diagnostic also.
My command line
./eval-callback -m models/Mistral-7B-v01/ggml-model-f16.gguf --prompt hello --seed 1023
the f16 model is from the command
python3 convert.py models/llama-2-13b --outtype f16
the Q5 version is from doing quantization
./quantize models/Mistral-7B-v0.1/ggml-model-f16.gguf models/Mistral-7B-v0.1/ggml-model-q5_K_M.bin q5_K_M
I figured that you used -ngl 33
with llama-2-13b f16, and tried to reproduce the eval-callback result. The first significant difference I see is this:
ggml_debug: ffn_out-7 = (f32) MUL_MAT(blk.7.ffn_down.weight{13824, 5120, 1, 1}, ffn_gate_par-7{13824, 2, 1, 1}}) = {5120, 2, 1, 1}
[
[
[ -0.2148, -0.5171, 0.0000, ..., 0.1681, -0.2013, -0.0247],
[ 0.0996, -0.0289, -0.1234, ..., -0.1000, 0.1619, -0.0583],
],
]
sum = -0.838913
ggml_debug: ffn_out-7 = (f32) MUL_MAT(blk.7.ffn_down.weight{13824, 5120, 1, 1}, ffn_gate_par-7{13824, 2, 1, 1}}) = {5120, 2, 1, 1}
[
[
[ -0.2151, -0.5142, -0.0017, ..., 0.1688, -0.2012, -0.0216],
[ 0.0499, -0.0531, -0.1272, ..., -0.0833, 0.1644, -0.0731],
],
]
With each matrix multiplication, results get progressively worse, until eventually it produces only nan
in a matrix multiplication. I can only explain this with either data corruption or hardware error.
Running pytorch-gpu-benchmark and we are at above 6G of VRAM usage (way more than the 2G for the test above) and it's humming right along without any issues so far. I will post the final benchmark result when it completes.
Maybe it is c/c++ version of this ? torch.cuda.synchronize() that is missing somewhere?
You can test for that by using the CUDA_LAUNCH_BLOCKING=1
env variable.
Is there a way to force single GPU usage? The pytorch benchmark seems to run fine on one GPU but have issues when dual GPU were used.
Using the CUDA_LAUNCH_BLOCKING=1 env variable yielded the same results.
Thank you for your help. After running GPU VRAM tests, we found that there may be indeed hardware issues.
...
[05/17/2024 13:59:52][Kaui][0]:ERROR: 7th error, expected value=0x850f3f1b, current value=0x850a3f1b, diff=0x50000 (second_read=0x850a3f1b, expect=0x850f3f1b, diff with expected value=0x50000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 8th error, expected value=0x850f3f1b, current value=0xd50a3f1b, diff=0x50050000 (second_read=0xd50a3f1b, expect=0x850f3f1b, diff with expected value=0x50050000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 9th error, expected value=0x850f3f1b, current value=0x850a3f1b, diff=0x50000 (second_read=0x850a3f1b, expect=0x850f3f1b, diff with expected value=0x50000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: NVRM version: NVIDIA UNIX x86_64 Kernel Module 550.54.15 Tue Mar 5 22:23:56 UTC 2024
[05/17/2024 13:59:52][Kaui][0]:ERROR: The unit serial number is 0320918003682
[05/17/2024 13:59:52][Kaui][0]:ERROR: (move_inv_read) 16504 errors found in block 3200
[05/17/2024 13:59:52][Kaui][0]:ERROR: the last 10 error addresses are: 0x76d48b5fcbec 0x76d48b5fcbf4 0x76d48b5fcbfc 0x76d48b1fe110 0x76d48b5fcb5c 0x76d48b5fcba4 0x76d48b5fcbac 0x76d48b5fcbb4 0x76d48b5fcbbc 0x76d48b5fcbe4
[05/17/2024 13:59:52][Kaui][0]:ERROR: 0th error, expected value=0x850f3f1b, current value=0xd50a3f1b, diff=0x50050000 (second_read=0xd50a3f1b, expect=0x850f3f1b, diff with expected value=0x50050000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 1th error, expected value=0x850f3f1b, current value=0x850a3f1b, diff=0x50000 (second_read=0x850a3f1b, expect=0x850f3f1b, diff with expected value=0x50000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 2th error, expected value=0x850f3f1b, current value=0xd50a3f1b, diff=0x50050000 (second_read=0xd50a3f1b, expect=0x850f3f1b, diff with expected value=0x50050000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 3th error, expected value=0x850f3f1b, current value=0x850f3f1f, diff=0x4 (second_read=0x850f3f1f, expect=0x850f3f1b, diff with expected value=0x4)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 4th error, expected value=0x850f3f1b, current value=0xd50a3f1b, diff=0x50050000 (second_read=0xd50a3f1b, expect=0x850f3f1b, diff with expected value=0x50050000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 5th error, expected value=0x850f3f1b, current value=0x850a3f1b, diff=0x50000 (second_read=0x850a3f1b, expect=0x850f3f1b, diff with expected value=0x50000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 6th error, expected value=0x850f3f1b, current value=0xd50a3f1b, diff=0x50050000 (second_read=0xd50a3f1b, expect=0x850f3f1b, diff with expected value=0x50050000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 7th error, expected value=0x850f3f1b, current value=0x850a3f1b, diff=0x50000 (second_read=0x850a3f1b, expect=0x850f3f1b, diff with expected value=0x50000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 8th error, expected value=0x850f3f1b, current value=0xd50a3f1b, diff=0x50050000 (second_read=0xd50a3f1b, expect=0x850f3f1b, diff with expected value=0x50050000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 9th error, expected value=0x850f3f1b, current value=0x850a3f1b, diff=0x50000 (second_read=0x850a3f1b, expect=0x850f3f1b, diff with expected value=0x50000)
I am running llama_cpp version 0.2.68 on Ubuntu 22.04LTS under conda environment. Attached are two Jupyter notebooks with ONLY one line changed (use CPU vs GPU). As you can see for exact same environmental conditions switching between CPU/GPU gives vastly different answers where the GPU is completely wrong. Some pointers on how to debug this I would appreciate it.
The only significant difference between the two files is this one liner
#n_gpu_layers=-1, # Uncomment to use GPU acceleration
The model used was openhermes-2.5-mistral-7b.Q5_K_M.gguf
mistral_llama_large-gpu.pdf mistral_llama_large-cpu.pdf