Generating garbage output on CUDA when GGML_CUDA_FORCE_DMMV is set to false

LostRuins commented 1 year ago

OS: Windows 10 LTSC 1809, using provided CU 11.7.1 runtimes from this repo's workflow.

I have a RTX 2060 card, and ever since https://github.com/ggerganov/llama.cpp/pull/2067 was merged, my system generates garbage output with CuBLAS, if any GPU layers are offloaded. This does not happen if GGML_CUDA_FORCE_DMMV is set to true, or if 0 layers are offloaded.

Example output:

E:\LLaMA\llamacpp>main.exe -m E:\LLaMA\models\test_models\open-llama-3b-q4_0.bin -ngl 66 -p "Hello, my name is"
main: build = 800 (481f793)
main: seed  = 1688744741
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5
llama.cpp: loading model from E:\LLaMA\models\test_models\open-llama-3b-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 3200
llama_model_load_internal: n_mult     = 216
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 26
llama_model_load_internal: n_rot      = 100
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 8640
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size =    0.06 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 1078.99 MB (+  682.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 288 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 26 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 29/29 layers to GPU
llama_model_load_internal: total VRAM used: 2754 MB
llama_new_context_with_model: kv self size  =  162.50 MB

system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

 Hello, my name is Zahara and I ammanuel and this is my blog where I post my experiences as a travelerbola and a gamblerayam squeeze.
ISummary fromadhd dheg aad karakdhek e-mail addeold bhagkdg kdshs aad agkdg satraveleds kas ksms kdgt aada dhgk aadgk aadksh dhgk dhenkdg dhgs ksdfd aagdg agk aaagh aabkdhg agkdhgaadhg aaadhdght dgeekdg agkdhg aaagh aagkdhgi agkdg ksdsagdg aagkdhgi aabkhkdg aaagh aabkdgk dkdg aaagh aaahgk aabkhg aaaggkdg aabkdg aaadhgk aaagh aagkdgk dkeakdh ks

another attempt, with fewer layers:

E:\LLaMA\llamacpp>main.exe -m E:\LLaMA\models\test_models\open-llama-3b-q4_0.bin -ngl 20 -p "Hello, my name is"
main: build = 800 (481f793)
main: seed  = 1688745037
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5
llama.cpp: loading model from E:\LLaMA\models\test_models\open-llama-3b-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 3200
llama_model_load_internal: n_mult     = 216
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 26
llama_model_load_internal: n_rot      = 100
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 8640
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size =    0.06 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 1532.89 MB (+  682.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 288 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 20 repeating layers to GPU
llama_model_load_internal: offloaded 20/29 layers to GPU
llama_model_load_internal: total VRAM used: 1618 MB
llama_new_context_with_model: kv self size  =  162.50 MB

system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

 Hello, my name is Mary###############################################################################################################################################################################################################

ggerganov commented 1 year ago

I ran the latest master on a RTX 2060 with LLaMA 7B and it looks ok:

 19:07:18  master  $  make -j && ./bin/main -m ../models/7B/ggml-model-q4_0.bin -ngl 66 -p "Hello, my name is" -n 128
[  2%] Built target BUILD_INFO
[  8%] Built target ggml
[ 10%] Built target ggml_static
[ 15%] Built target llama
[ 19%] Built target test-quantize-fns
[ 23%] Built target test-sampling
[ 32%] Built target quantize-stats
[ 32%] Built target test-tokenizer-0
[ 34%] Built target common
[ 39%] Built target test-quantize-perf
[ 43%] Built target quantize
[ 47%] Built target baby-llama
[ 52%] Built target perplexity
[ 56%] Built target benchmark
[ 60%] Built target embedding
[ 65%] Built target train-text-from-scratch
[ 73%] Built target q8dot
[ 78%] Built target main
[ 78%] Built target vdot
[ 84%] Built target server
[ 86%] Built target save-load-state
[ 91%] Built target simple
[ 95%] Built target embdinput
[100%] Built target embd-input-test
main: build = 802 (7242140)
main: seed  = 1688746045
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5
llama.cpp: loading model from ../models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0,08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 1862,39 MB (+ 1026,00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 288 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 35/35 layers to GPU
llama_model_load_internal: total VRAM used: 4892 MB
llama_new_context_with_model: kv self size  =  256,00 MB

system_info: n_threads = 12 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0

 Hello, my name is Katrina (sometimes known as “Jar”). I’m a full-time writer and an aspiring author. I started writing stories at a very young age and it was always my dream to write a book!
I believe every writer has their own unique style. Some writers like to use a pen, others prefer the keyboard. I like to use both when it comes to writing my books! I have a great passion for reading and I hope that this will influence my future as an author. My current interests are science fiction, urban fantasy and horror genres.
I write in my free time on week
llama_print_timings:        load time =  1184,95 ms
llama_print_timings:      sample time =    85,22 ms /   128 runs   (    0,67 ms per token,  1501,98 tokens per second)
llama_print_timings: prompt eval time =   225,13 ms /     6 tokens (   37,52 ms per token,    26,65 tokens per second)
llama_print_timings:        eval time =  2228,14 ms /   127 runs   (   17,54 ms per token,    57,00 tokens per second)
llama_print_timings:       total time =  2569,25 ms

Maybe it is related to the 3B model. Does it work for you with the 7B model?

LostRuins commented 1 year ago

Much better results. I think you are right, something with the 3B model doesn't work.

E:\LLaMA\llamacpp>main.exe -ngl 40 -p "Hello, my name is" -m E:\LLaMA\models\orca-mini-v2_7b.ggmlv3.q4_0.bin
main: build = 800 (481f793)
main: seed  = 1688746676
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5
llama.cpp: loading model from E:\LLaMA\models\orca-mini-v2_7b.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 1862.39 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 288 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 35/35 layers to GPU
llama_model_load_internal: total VRAM used: 4860 MB
llama_new_context_with_model: kv self size  =  256.00 MB

system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

 Hello, my name is Nathalie and I am a French native speaker. I offer a personalized language tutoring service to help you learn or improve your French language skills.
I have a Bachelor's degree in Modern Languages (French and Spanish) and a Master's degree in Teaching English as a Foreign Language. During my studies, I gained experience in teaching languages both in France and in the UK.
I am passionate about languages and love helping others learn and improve their language skills. I have experience working with students of all levels and can tailor my lessons to your specific needs and goals.
I use a variety of methods and materials to teach French, including textbooks, online resources, videos, songs, and role-plays. My lessons are dynamic, engaging, and tailored to the student's interests and learning style.
I look forward to helping you improve your French language skills and sharing my passion for this beautiful language! [end of text]

llama_print_timings:        load time =  2464.44 ms
llama_print_timings:      sample time =    48.42 ms /   199 runs   (    0.24 ms per token,  4109.87 tokens per second)
llama_print_timings: prompt eval time =   417.64 ms /     6 tokens (   69.61 ms per token,    14.37 tokens per second)
llama_print_timings:        eval time =  8845.96 ms /   198 runs   (   44.68 ms per token,    22.38 tokens per second)
llama_print_timings:       total time =  9356.55 ms

E:\LLaMA\llamacpp>

slaren commented 1 year ago

I also noticed bad results with the 3B model with a 3080.

JohannesGaessler commented 1 year ago

My intuition is that it's an issue with padding when converting the vector to q8_1.

slaren commented 1 year ago

The issue seems to be in the mul mat. Running it with compute-sanitizer (also add -lineinfo to NVCCFLAGS to get line numbers):

========= Invalid __global__ read of size 1 bytes
=========     at 0xb10 in /home/slaren/code/llama.cpp/ggml-cuda.cu:1235:vec_dot_q4_0_q8_1(const void *, const block_q8_1 *, int)
=========     by thread (24,0,0) in block (0,3199,0)
=========     Address 0xdb50d4e02 is out of bounds
=========     and is 3 bytes after the nearest allocation at 0xdb4200000 of size 15552000 bytes
=========     Device Frame:/home/slaren/code/llama.cpp/ggml-cuda.cu:1392:void mul_mat_vec_q<(int)32, (int)4, block_q4_0, &vec_dot_q4_0_q8_1>(const void *, const void *, float *, int, int) [0xaf0]

ggerganov / llama.cpp

Generating garbage output on CUDA when GGML_CUDA_FORCE_DMMV is set to false #2136