Significantly different results (and WRONG) inference when GPU is enabled.

phishmaster commented 6 months ago

I am running llama_cpp version 0.2.68 on Ubuntu 22.04LTS under conda environment. Attached are two Jupyter notebooks with ONLY one line changed (use CPU vs GPU). As you can see for exact same environmental conditions switching between CPU/GPU gives vastly different answers where the GPU is completely wrong. Some pointers on how to debug this I would appreciate it.

The only significant difference between the two files is this one liner #n_gpu_layers=-1, # Uncomment to use GPU acceleration

The model used was openhermes-2.5-mistral-7b.Q5_K_M.gguf

mistral_llama_large-gpu.pdf mistral_llama_large-cpu.pdf

JohannesGaessler commented 6 months ago

Are you getting correct results when you use the llama.cpp binaries directly without any Python bindings? If not, are you getting correct results when you compile with LLAMA_CUDA_FORCE_MMQ?

phishmaster commented 6 months ago

Thanks for your response.

First I recompiled llama.cpp with the suggested flag LLAMA_CUDA_FORCE_MMQ e.g. LLAMA_CUDA=1 LLAMA_CUDA_FORCE_MMQ=1 make

Once completed I ran the binary as follow

When I do not have "-ngl 40" it seems to give the correct answer. " ./main_gpu -m /data/DemoV1/Model4Demo/openhermes-2.5-mistral-7b.Q5_K_M.gguf \ -c 16192 -b 1024 -n 256 --keep 48 \ --repeat_penalty 5.0 --color -i \ -r "User:" -f prompts/chat-with-bob.txt ... User: Hello, Bob. Bob: Hello. How may I help you today? User: Please tell me the largest city in Europe. Bob: Sure. The largest city in Europe is Moscow, the capital of Russia. User:what is the capital city of France? Bob: The Capital City Of Paris.<|im_end|>

" However when I ran it with -ngl 40 this is the response ... User:what is the capital city of France? ################################################################################################################## ################################################################################################################## ###################

From: Johannes Gäßler @.> Sent: Friday, May 3, 2024 4:15 AM To: ggerganov/llama.cpp @.> Cc: Huy Vu @.>; Author @.> Subject: Re: [ggerganov/llama.cpp] Significantly different results (and WRONG) inference when GPU is enabled. (Issue #7048)

Caution: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Are you getting correct results when you use the llama.cpp binaries directly without any Python bindings? If not, are you getting correct results when you compile with LLAMA_CUDA_FORCE_MMQ?

— Reply to this email directly, view it on GitHubhttps://github.com/ggerganov/llama.cpp/issues/7048#issuecomment-2092529864, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BIHUORKZ5IDGM7ZSNYYI2BLZANBRPAVCNFSM6AAAAABHEIE4XSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJSGUZDSOBWGQ. You are receiving this because you authored the thread.Message ID: @.***>

JohannesGaessler commented 6 months ago

Sorry, I wanted to check this issue but forgot. Did you download a ready-made GGUF file from Huggingface or did you convert it yourself? If it's the former, can you provide a link to the exact file you downloaded?

phishmaster commented 6 months ago

No problem and thanks for responding. I downloaded the model from Huggingface at

https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF/blob/main/openhermes-2.5-mistral-7b.Q5_K_M.gguf

Huy

From: Johannes Gäßler @.> Sent: Wednesday, May 8, 2024 11:32 AM To: ggerganov/llama.cpp @.> Cc: Huy Vu @.>; Author @.> Subject: Re: [ggerganov/llama.cpp] Significantly different results (and WRONG) inference when GPU is enabled. (Issue #7048)

Caution: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Sorry, I wanted to check this issue but forgot. Did you download a ready-made GGUF file from Huggingface or did you convert it yourself? If it's the former, can you provide a link to the exact file you downloaded?

— Reply to this email directly, view it on GitHubhttps://github.com/ggerganov/llama.cpp/issues/7048#issuecomment-2100853306, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BIHUORPU7HZDYA53C354M53ZBJARPAVCNFSM6AAAAABHEIE4XSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBQHA2TGMZQGY. You are receiving this because you authored the thread.Message ID: @.***>

JohannesGaessler commented 6 months ago

I cannot reproduce the issue on master. Can you re-download the model and check that this issue isn't due to a corrupted file?

phishmaster commented 6 months ago

Here is my git master

* 83330d8c - (HEAD -> master, origin/master, origin/HEAD) main : add --conversation / -cnv flag (#7108) (2 hours ago) [Dawid Potocki]
* 465263d0 - sgemm : AVX Q4_0 and Q8_0 (#6891) (2 hours ago) [Eve]
* 911b3900 - server : add_special option for tokenize endpoint (#7059) (4 hours ago) [Johan]

clean and rebuild with


LLAMA_CUDA=1 LLAMA_CUDA_FORCE_MMQ=1 make clean
LLAMA_CUDA=1 LLAMA_CUDA_FORCE_MMQ=1 make

re-downloaded the model (also matches with my previously downloaded file)


(base) hvu@Kaui:/data/DemoV1/Model4Demo$ md5sum openhermes-2.5-mistral-7b.Q5_K_M_v1.gguf

f7faa7e315e81c3778aae55fcb5fc02c openhermes-2.5-mistral-7b.Q5_K_M_v1.gguf

(pytorch_py39_cu11.8) hvu@Kaui:~/Demo_v1/llama.cpp_v3$ ./main_gpu -m /data/DemoV1/Model4Demo/openhermes-2.5-mistra
l-7b.Q5_K_M.gguf \                                                                                                
    -c 16192 -b 1024 -n 256 --keep 48 \                                                                           
    --repeat_penalty 5.0 --color -i \                                                                             
    -r "User:" -f prompts/chat-with-bob.txt  
...
llm_load_print_meta: EOT token        = 32000 '<|im_end|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
  Device 1: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  4893.00 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: n_batch    = 1024
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1147.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    40.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 356

system_info: n_threads = 24 / 48 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI 
= 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 
| VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
main: interactive mode on.
...
ob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:what is the capital city of France?
Bob: The Capital City Of france Is Paris.<|im_end|>

=========================================== I tried various values of -ngl and all seems to return garbage

(pytorch_py39_cu11.8) hvu@Kaui:~/Demo_v1/llama.cpp_v3$ ./main_gpu -m /data/DemoV1/Model4Demo/openhermes-2.5-mistra
l-7b.Q5_K_M.gguf     -c 16192 -b 1024 -n 256 --keep 48     --repeat_penalty 5.0 --color -i     -r "User:" -f promp
ts/chat-with-bob.txt -ngl 40
Log start
main: build = 2817 (83330d8c)
...
ggml_cuda_init: found 2 CUDA devices:
  Device 0: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
  Device 1: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.44 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    85.94 MiB
llm_load_tensors:      CUDA0 buffer size =  2352.25 MiB
llm_load_tensors:      CUDA1 buffer size =  2454.81 MiB
...
ob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:what is the capital city of France?
##################################################################################################################
##################################################################################################################
###################
------------------------------------------------------

Other values of -ngl 16

...
llm_load_tensors: ggml ctx size =    0.44 MiB                                                                     
llm_load_tensors: offloading 16 repeating layers to GPU                                                           
llm_load_tensors: offloaded 16/33 layers to GPU                                                                   
llm_load_tensors:        CPU buffer size =  4893.00 MiB                                                           
llm_load_tensors:      CUDA0 buffer size =  1160.19 MiB                                                           
llm_load_tensors:      CUDA1 buffer size =  1192.06 MiB                                                           
.....
User:what is the capital city of France?
#<s>▅

$<s>#"
      "!<s> 
</s>

        $
!<s>!!"

       "
$#

""<s>

Pretty much the same for -ngl 8

JohannesGaessler commented 6 months ago

If I remember correctly the output

#<s>▅

$<s>#"
      "!<s> 
</s>

        $
!<s>!!"

       "
$#

""<s>

is effectively what you get when a NO_DEVICE_CODE isn't being correctly triggered. My intuition is that this issue is specific to a V100 GPU (and maybe also the CUDA version). If possible, please check the following:

Results after running export CUDA_VISIBLE_DEVICES=0 (makes it so that only the first GPU is used).
Results on a non-V100 GPU.
Results using another model, ideally with n_vocab of 32000 (check the console log, Mistral base model should have that vocab size).
Results using CUDA 12.

JohannesGaessler commented 6 months ago

Also: with SXM your V100s are effectively NVLinked, right? Can you check results when compiling with LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0 and LLAMA_CUDA_PEER_MAX_BATCH_SIZE=999999?

phishmaster commented 6 months ago

Remake with suggested flags

LLAMA_CUDA=1 LLAMA_CUDA_FORCE_MMQ=1 CUDA_VISIBLE_DEVICES=0 LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0 LLAMA_CUDA_PEER_MAX_BATCH_SIZE=999999 make

I think I already have CUDA 12

(llama_cpp_py39) hvu@Kaui:~/Demo_v1/llama.cpp_v3$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

Now for the run

(pytorch_py39_cu11.8) hvu@Kaui:~/Demo_v1/llama.cpp_v3$ export CUDA_VISIBLE_DEVICES=0                              
(pytorch_py39_cu11.8) hvu@Kaui:~/Demo_v1/llama.cpp_v3$ ./main_gpu -m /data/DemoV1/Model4Demo/openhermes-2.5-mistral-7b.Q5_K_M.gguf     -c 16192 -b 1024 -n 256 --keep 48     --repeat_penalty 5.0 --color -i     -r "User:" -f prompts/chat-with-bob.txt -ngl 8      
...
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 8 repeating layers to GPU
llm_load_tensors: offloaded 8/33 layers to GPU
llm_load_tensors:        CPU buffer size =  4893.00 MiB
llm_load_tensors:      CUDA0 buffer size =  1192.06 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: n_batch    = 1024
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =  1536.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   512.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1147.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    40.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 268
...
User:what is the capital city of France?
 ▅</s>"!"!

          " 

</s></s▅"$#<s># "<s>

Same results

JohannesGaessler commented 6 months ago

When checking the NVCC version your shell prefix is (llama_cpp_py39). When you actually run the model the prefix is (pytorch_py39_cu11.8). Are you sure that in both cases CUDA 12 is being used?

JohannesGaessler commented 6 months ago

Also, I didn't mean to compile with both LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0 and LLAMA_CUDA_PEER_MAX_BATCH_SIZE=999999 at the same time. I meant to test either option individually. But if you still get incorrect results with CUDA_VISIBLE_DEVICES=0 that's not going to be the problem anyways.

phishmaster commented 6 months ago

Remake in base conda environment (default nvcc)

(base) hvu@Kaui:~/Demo_v1/llama.cpp_v3$ LLAMA_CUDA=1 LLAMA_CUDA_FORCE_MMQ=1 CUDA_VISIBLE_DEVICES=0 LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0 make
....
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes                                                                        
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no                                                                         
ggml_cuda_init: found 2 CUDA devices:                                                                             
  Device 0: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes                                                
  Device 1: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes                                                
llm_load_tensors: ggml ctx size =    0.44 MiB                                                                     
llm_load_tensors: offloading 8 repeating layers to GPU                                                            
llm_load_tensors: offloaded 8/33 layers to GPU                                                                    
llm_load_tensors:        CPU buffer size =  4893.00 MiB                                                           
llm_load_tensors:      CUDA0 buffer size =   588.06 MiB                                                           
llm_load_tensors:      CUDA1 buffer size =   604.00 MiB            
...

Same results

</s>What is the capital city of France?

 #"<s> </s>
<s>$
 #▅"#
!<s><s>/s><s$<s>" <s>"

llama_print_timings:        load time =    2860.74 ms
llama_print_timings:      sample time =      70.35 ms /   161 runs   (    0.44 ms per token,  2288.46 tokens per second)
llama_print_timings: prompt eval time =     360.26 ms /     9 tokens (   40.03 ms per token,    24.98 tokens per second)
llama_print_timings:        eval time =   12501.69 ms /   160 runs   (   78.14 ms per token,    12.80 tokens per second)
llama_print_timings:       total time =   13594.98 ms /   169 tokens
(base) hvu@Kaui:~/Demo_v1/llama.cpp_v3$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
(base) hvu@Kaui:~/Demo_v1/llama.cpp_v3$

slaren commented 6 months ago

Do you get any errors with compute-sanitizer? Run compute-sanitizer ./main -m ..

phishmaster commented 6 months ago

No errors, and the "compute-sanitizer" didn't seem to help; however, it seems to work better if I use -ngl -1 instead of any specific values. Does that help?

slaren commented 6 months ago

-ngl -1 is effectively the same as -ngl 0.

slaren commented 6 months ago

What driver version are you using? Run nvidia-smi.

phishmaster commented 6 months ago

(base) hvu@Kaui:~/Demo_v1/llama.cpp_v3$ nvidia-smi 
Wed May  8 13:37:21 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-SXM2-16GB           Off |   00000000:1D:00.0 Off |                  Off |
| N/A   40C    P0             54W /  300W |    1673MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla V100-SXM2-16GB           Off |   00000000:1E:00.0 Off |                  Off |
| N/A   46C    P0             54W /  300W |     623MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1557      G   /usr/lib/xorg/Xorg                              4MiB |
|    0   N/A  N/A   2365880      C   ...envs/pytorch_py39_cu11.8/bin/python        584MiB |
|    0   N/A  N/A   2665195      C   ...envs/pytorch_py39_cu11.8/bin/python        498MiB |
|    0   N/A  N/A   2683103      C   ...envs/pytorch_py39_cu11.8/bin/python        584MiB |
|    1   N/A  N/A      1557      G   /usr/lib/xorg/Xorg                              4MiB |
|    1   N/A  N/A   2365880      C   ...envs/pytorch_py39_cu11.8/bin/python        308MiB |
|    1   N/A  N/A   2683103      C   ...envs/pytorch_py39_cu11.8/bin/python        308MiB |
+-----------------------------------------------------------------------------------------+

JohannesGaessler commented 6 months ago

According to the HuggingFace repository, the model was made with llama.cpp revision 629f917. Do you get correct results with that revision?

phishmaster commented 6 months ago

Here is my current repo.

629f917c - (HEAD, tag: b1477) cuda : add ROCM aliases for CUDA pool stuff (#3918) (6 months ago) [Kerfuffle]

51b2fc11 - (tag: b1476) cmake : fix relative path to git submodule index (#3915) (6 months ago) [Andrei] ... That version seems to have worked beautifully but I think it's because that version of llama.cpp doesn't support GPU offloading.


(base) hvu@Kaui:~/Demo_v1/llama.cpp_629f917$ ./main -m /data/DemoV1/Model4Demo/openhermes-2.5-mistral-7b.Q5_K_M.gg
uf     -c 16192 -b 1024 -n 256 --keep 48     --repeat_penalty 5.0 --color -i     -r "User:" -p "What is the capita
l city of France?" -ngl 40                                                                                        
warning: not compiled with GPU offload support, --n-gpu-layers option will be ignored                             
warning: see main README.md for information on enabling GPU BLAS support                                          
Log start                                                                                                         
main: build = 1477 (629f917c)                                                                                     
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu                                    
main: seed  = 1715190618                                                                                          
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /data/DemoV1/Model4Demo/openherm
es-2.5-mistral-7b.Q5_K_M.gguf (version GGUF V3 (latest))
...

llama_new_context_with_model: n_ctx = 16192 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: kv self size = 2024.00 MB llama_build_graph: non-view tensors processed: 740/740 llama_new_context_with_model: compute buffer total size = 2141.88 MB

...

What is the capital city of France? Paris. The French government has its seat in Paris, which also serves as an important center for culture and business within Europe due to being one if not THE most famous cities globally known across different industries such arts or fashion among others that contribute significantly towards global economy with their unique products like h aute couture designs from Christian Dior house based right here! Paris is located in the northern part of France, near Normandy and Brittany. It has a population over 2 million people making it one if not THE largest cities globally known across different industries such arts or fashion among others that contribute significantly towards global economy with their unique products like haute couture designs from Christian Dior house based right here! Paris is often referred to as the “City of Love”, and for good reason. The city boasts some amazing architecture, including Notre-Dame Cathedral which has been featured in countless films; it’s also home base during fashion weeks where designers showcase their latest collections on runways around town! Paris was founded by Celtic tribes known as Parisii back before Christ when they settled along the Seine River. Later conquered and ruled successively through Roman, Frankish (Merovingian), Carolingians

phishmaster commented 6 months ago

Recompile with the right option to enable CUDA and same problem

(base) hvu@Kaui:~/Demo_v1/llama.cpp_629f917$ ./main -m /data/DemoV1/Model4Demo/openhermes-2.5-mistral-7b[377/1833]
uf     -c 16192 -b 1024 -n 256 --keep 48     --repeat_penalty 5.0 --color -i     -r "User:" -p "What is the capita
l city of France?" -ngl 40                                                                                        
Log start                                                                                                         
main: build = 1477 (629f917c)                                                                                     
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu                                    
main: seed  = 1715191104                                                                                          
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla V100-SXM2-16GB, compute capability 7.0
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /data/DemoV1/Model4Demo/openherm
es-2.5-mistral-7b.Q5_K_M.gguf (version GGUF V3 (latest))

...
llm_load_tensors: ggml ctx size =    0.11 MB                                                             [17/1833]
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =   86.05 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 4807.06 MB
...
What is the capital city of France?###############################################################################
##################################################################################################################
###############################################################

phishmaster commented 5 months ago

Experimenting with various -ngl values, it seems like keeping it below 20 seems to help for many models. At a certain point it just flipped from working to garbage. In the experiment the model llama-2-13b/ggml-model-q5_K_M.bin works with -ngl at 22 or below.

Here is an example of it working with python binding

...
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
  Device 1: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.55 MiB
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloaded 16[/41](http://kaui:8888/41) layers to GPU
llm_load_tensors:        CPU buffer size =  5362.94 MiB
llm_load_tensors:      CUDA0 buffer size =  1700.92 MiB
llm_load_tensors:      CUDA1 buffer size =  1737.77 MiB
...
output = llm(
      "Q: Name all the planets in the solar system? A:", # Prompt
      max_tokens=256, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)
...
{'id': 'cmpl-6ddf6146-082d-40ed-9188-7acea7ee3f6d', 'object': 'text_completion', 'created': 1715260091, 'model': '/data/llama2/llama-2-13b/ggml-model-q5_K_M.bin', 'choices': [{'text': 'Q: Name all the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 15, 'completion_tokens': 24, 'total_tokens': 39}}

With 16 layers, only 2.2x2GB of VRAM was used. 22 layers, 3x2GB VRAM --> Working

At 23 layers, the answer came back garbage and 3x2GB VRAM usage which is way below the 16x2 GB VRAM available.

llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
  Device 1: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.55 MiB
llm_load_tensors: offloading 23 repeating layers to GPU
llm_load_tensors: offloaded 23[/41](http://kaui:8888/41) layers to GPU
llm_load_tensors:        CPU buffer size =  3882.31 MiB
llm_load_tensors:      CUDA0 buffer size =  2545.23 MiB
llm_load_tensors:      CUDA1 buffer size =  2374.08 MiB
...
output = llm(
      "Q: Name all the planets in the solar system? A:", # Prompt
      max_tokens=256, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)
...
{'id': 'cmpl-9fc9e4b3-b052-401a-8984-67b72fbe5b31', 'object': 'text_completion', 'created': 1715260540, 'model': '/data/llama2/llama-2-13b/ggml-model-q5_K_M.bin', 'choices': [{'text': 'Q: Name all the planets in the solar system? A: 23,495', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 15, 'completion_tokens': 8, 'total_tokens': 23}}

If the user goes beyond the supported value, there should be a warning or error? Also is there a deterministic way of knowing what value of -ngl will work vs when it will return garbage?

slaren commented 5 months ago

It should work with any value. You could try running the eval-callback example with CPU and with full offload and try to see what's the first operation that produces significantly different values.

phishmaster commented 5 months ago

eval-callback_gpu_f16.log eval-callback_cpu.log eval-callback.log The eval-callback logs are attached. gpu_f16 - Uses the f16 model instead of the quantized eval_callback.log - Use the Q5 model cpu - (-ngl 0)

slaren commented 5 months ago

Sorry, eval-callback was broken and the numbers are useless. Please try again with #7184 or after it is merged.

phishmaster commented 5 months ago

Pulled from master

4f026363 - (HEAD -> master, origin/master, origin/HEAD) server: free sampling contexts on exit (#7264) (52 minutes ago) [Steve Grubb]

and reran the eval-callback. Logs are attached eval-callback_q5km_gpu.log eval-callback_q5km_cpu.log eval-callback_f16_gpu.log

eval-callback_f16_cpu.log

phishmaster commented 5 months ago

Additionally for version 4f02636 -ngl 20 and below seems to work fine, anything above this the results are garbage ..

<s> What is the capital city of France?
▅! "# 

<s</s>
##$</s>▅        ▅

</s>
        "       "</s> ▅

slaren commented 5 months ago

Can you share the full command line that you used to generate the eval-callback logs? What f16 model did you use? It seems to break down at the end of layer 11. Did you try enabling ECC? (nvidia-smi -e 1) Also try using the environment variable CUDA_LAUNCH_BLOCKING=1.

ggml_debug:          ffn_gate_par-11 = (f32)        MUL(ffn_silu-11{13824, 2, 1, 1}, ffn_up-11{13824, 2, 1, 1}}) = {13824, 2, 1, 1}
                                     [
                                      [
                                       [     -0.0039,       0.0012,      -0.0001, ...,       0.0058,      -0.0061,       0.0020],
                                       [     -0.0036,      -0.0129,      -0.0303, ...,       0.0178,      -0.0390,       0.0095],
                                      ],
                                     ]
                                     sum = -0.059607
ggml_debug:               ffn_out-11 = (f32)    MUL_MAT(blk.11.ffn_down.weight{13824, 5120, 1, 1}, ffn_gate_par-11{13824, 2, 1, 1}}) = {5120, 2, 1, 1}
                                     [
                                      [
                                       [         nan,          nan,          nan, ...,          nan,          nan,          nan],
                                       [         nan,          nan,          nan, ...,          nan,          nan,          nan],
                                      ],
                                     ]
                                     sum = nan
ggml_debug:                 l_out-11 = (f32)        ADD(ffn_out-11{5120, 2, 1, 1}, ffn_inp-11{5120, 2, 1, 1}}) = {5120, 2, 1, 1}
                                     [
                                      [
                                       [         nan,          nan,          nan, ...,          nan,          nan,          nan],
                                       [         nan,          nan,          nan, ...,          nan,          nan,          nan],
                                      ],
                                     ]
                                     sum = nan
ggml_debug:                  norm-12 = (f32)   RMS_NORM(l_out-11{5120, 2, 1, 1}, }) = {5120, 2, 1, 1}
                                     [
                                      [
                                       [         nan,          nan,          nan, ...,          nan,          nan,          nan],
                                       [         nan,          nan,          nan, ...,          nan,          nan,          nan],
                                      ],
                                     ]
                                     sum = nan

slaren commented 5 months ago

My guess is that this is a hardware failure of some sort. Are you using a custom build these V100 that might not provide enough power or cooling?

phishmaster commented 5 months ago

I highly doubt that it's enough power or cooling as the source. Mainly it would imply a lot more randomness vs being very deterministic in term of failures at ~20 layers offload.

As for cooling, the server is housed in a rack and air conditioned.

Let me try enabling ECC and send results.

root@Kaui:/data# nvidia-smi -e 1
Enabled ECC support for GPU 00000000:1D:00.0.
Enabled ECC support for GPU 00000000:1E:00.0.
All done.
Reboot required

slaren commented 5 months ago

It's not likely to be an incompatibility with the GPU architecture, in fact the ggml-ci tests every commit on master on a PCIE V100. Whatever the issue is, it seems to be specific to your system. I know that some people have been trying to use V100 on custom builds since they are relatively cheap when bought used, and if this is the case here, I think that the most likely cause is some issue with the build.

phishmaster commented 5 months ago

We don't have anything "custom" that I am aware of. Pretty much standard server with 2 V100 GPUs. As for software it is Ubuntu 22 LTS and pre-built drivers.

phishmaster commented 5 months ago

Going to ask our IT folks to run a complete VRAM diagnostic also.

phishmaster commented 5 months ago

My command line ./eval-callback -m models/Mistral-7B-v01/ggml-model-f16.gguf --prompt hello --seed 1023 the f16 model is from the command python3 convert.py models/llama-2-13b --outtype f16 the Q5 version is from doing quantization ./quantize models/Mistral-7B-v0.1/ggml-model-f16.gguf models/Mistral-7B-v0.1/ggml-model-q5_K_M.bin q5_K_M

slaren commented 5 months ago

I figured that you used -ngl 33 with llama-2-13b f16, and tried to reproduce the eval-callback result. The first significant difference I see is this:

ggml_debug:                ffn_out-7 = (f32)    MUL_MAT(blk.7.ffn_down.weight{13824, 5120, 1, 1}, ffn_gate_par-7{13824, 2, 1, 1}}) = {5120, 2, 1, 1}
                                     [
                                      [
                                       [     -0.2148,      -0.5171,       0.0000, ...,       0.1681,      -0.2013,      -0.0247],
                                       [      0.0996,      -0.0289,      -0.1234, ...,      -0.1000,       0.1619,      -0.0583],
                                      ],
                                     ]
                                     sum = -0.838913

ggml_debug:                ffn_out-7 = (f32)    MUL_MAT(blk.7.ffn_down.weight{13824, 5120, 1, 1}, ffn_gate_par-7{13824, 2, 1, 1}}) = {5120, 2, 1, 1}
                                     [
                                      [
                                       [     -0.2151,      -0.5142,      -0.0017, ...,       0.1688,      -0.2012,      -0.0216],
                                       [      0.0499,      -0.0531,      -0.1272, ...,      -0.0833,       0.1644,      -0.0731],
                                      ],
                                     ]

With each matrix multiplication, results get progressively worse, until eventually it produces only nan in a matrix multiplication. I can only explain this with either data corruption or hardware error.

phishmaster commented 5 months ago

Running pytorch-gpu-benchmark and we are at above 6G of VRAM usage (way more than the 2G for the test above) and it's humming right along without any issues so far. I will post the final benchmark result when it completes. Screen Shot 2024-05-15 at 5 34 10 PM

phishmaster commented 5 months ago

Maybe it is c/c++ version of this ? torch.cuda.synchronize() that is missing somewhere?

slaren commented 5 months ago

You can test for that by using the CUDA_LAUNCH_BLOCKING=1 env variable.

phishmaster commented 5 months ago

Is there a way to force single GPU usage? The pytorch benchmark seems to run fine on one GPU but have issues when dual GPU were used.

phishmaster commented 5 months ago

Using the CUDA_LAUNCH_BLOCKING=1 env variable yielded the same results.

phishmaster commented 5 months ago

Thank you for your help. After running GPU VRAM tests, we found that there may be indeed hardware issues.

...
[05/17/2024 13:59:52][Kaui][0]:ERROR: 7th error, expected value=0x850f3f1b, current value=0x850a3f1b, diff=0x50000 (second_read=0x850a3f1b, expect=0x850f3f1b, diff with expected value=0x50000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 8th error, expected value=0x850f3f1b, current value=0xd50a3f1b, diff=0x50050000 (second_read=0xd50a3f1b, expect=0x850f3f1b, diff with expected value=0x50050000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 9th error, expected value=0x850f3f1b, current value=0x850a3f1b, diff=0x50000 (second_read=0x850a3f1b, expect=0x850f3f1b, diff with expected value=0x50000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: NVRM version: NVIDIA UNIX x86_64 Kernel Module  550.54.15  Tue Mar  5 22:23:56 UTC 2024
[05/17/2024 13:59:52][Kaui][0]:ERROR: The unit serial number is 0320918003682
[05/17/2024 13:59:52][Kaui][0]:ERROR: (move_inv_read) 16504 errors found in block 3200
[05/17/2024 13:59:52][Kaui][0]:ERROR: the last 10 error addresses are:  0x76d48b5fcbec  0x76d48b5fcbf4  0x76d48b5fcbfc  0x76d48b1fe110 0x76d48b5fcb5c   0x76d48b5fcba4  0x76d48b5fcbac  0x76d48b5fcbb4  0x76d48b5fcbbc  0x76d48b5fcbe4
[05/17/2024 13:59:52][Kaui][0]:ERROR: 0th error, expected value=0x850f3f1b, current value=0xd50a3f1b, diff=0x50050000 (second_read=0xd50a3f1b, expect=0x850f3f1b, diff with expected value=0x50050000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 1th error, expected value=0x850f3f1b, current value=0x850a3f1b, diff=0x50000 (second_read=0x850a3f1b, expect=0x850f3f1b, diff with expected value=0x50000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 2th error, expected value=0x850f3f1b, current value=0xd50a3f1b, diff=0x50050000 (second_read=0xd50a3f1b, expect=0x850f3f1b, diff with expected value=0x50050000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 3th error, expected value=0x850f3f1b, current value=0x850f3f1f, diff=0x4 (second_read=0x850f3f1f, expect=0x850f3f1b, diff with expected value=0x4)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 4th error, expected value=0x850f3f1b, current value=0xd50a3f1b, diff=0x50050000 (second_read=0xd50a3f1b, expect=0x850f3f1b, diff with expected value=0x50050000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 5th error, expected value=0x850f3f1b, current value=0x850a3f1b, diff=0x50000 (second_read=0x850a3f1b, expect=0x850f3f1b, diff with expected value=0x50000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 6th error, expected value=0x850f3f1b, current value=0xd50a3f1b, diff=0x50050000 (second_read=0xd50a3f1b, expect=0x850f3f1b, diff with expected value=0x50050000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 7th error, expected value=0x850f3f1b, current value=0x850a3f1b, diff=0x50000 (second_read=0x850a3f1b, expect=0x850f3f1b, diff with expected value=0x50000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 8th error, expected value=0x850f3f1b, current value=0xd50a3f1b, diff=0x50050000 (second_read=0xd50a3f1b, expect=0x850f3f1b, diff with expected value=0x50050000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 9th error, expected value=0x850f3f1b, current value=0x850a3f1b, diff=0x50000 (second_read=0x850a3f1b, expect=0x850f3f1b, diff with expected value=0x50000)

ggerganov / llama.cpp

Significantly different results (and WRONG) inference when GPU is enabled. #7048

Once completed I ran the binary as follow