Closed LostRuins closed 1 year ago
I ran the latest master
on a RTX 2060 with LLaMA 7B and it looks ok:
19:07:18 master $ make -j && ./bin/main -m ../models/7B/ggml-model-q4_0.bin -ngl 66 -p "Hello, my name is" -n 128
[ 2%] Built target BUILD_INFO
[ 8%] Built target ggml
[ 10%] Built target ggml_static
[ 15%] Built target llama
[ 19%] Built target test-quantize-fns
[ 23%] Built target test-sampling
[ 32%] Built target quantize-stats
[ 32%] Built target test-tokenizer-0
[ 34%] Built target common
[ 39%] Built target test-quantize-perf
[ 43%] Built target quantize
[ 47%] Built target baby-llama
[ 52%] Built target perplexity
[ 56%] Built target benchmark
[ 60%] Built target embedding
[ 65%] Built target train-text-from-scratch
[ 73%] Built target q8dot
[ 78%] Built target main
[ 78%] Built target vdot
[ 84%] Built target server
[ 86%] Built target save-load-state
[ 91%] Built target simple
[ 95%] Built target embdinput
[100%] Built target embd-input-test
main: build = 802 (7242140)
main: seed = 1688746045
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5
llama.cpp: loading model from ../models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0,08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 1862,39 MB (+ 1026,00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 288 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 35/35 layers to GPU
llama_model_load_internal: total VRAM used: 4892 MB
llama_new_context_with_model: kv self size = 256,00 MB
system_info: n_threads = 12 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0
Hello, my name is Katrina (sometimes known as “Jar”). I’m a full-time writer and an aspiring author. I started writing stories at a very young age and it was always my dream to write a book!
I believe every writer has their own unique style. Some writers like to use a pen, others prefer the keyboard. I like to use both when it comes to writing my books! I have a great passion for reading and I hope that this will influence my future as an author. My current interests are science fiction, urban fantasy and horror genres.
I write in my free time on week
llama_print_timings: load time = 1184,95 ms
llama_print_timings: sample time = 85,22 ms / 128 runs ( 0,67 ms per token, 1501,98 tokens per second)
llama_print_timings: prompt eval time = 225,13 ms / 6 tokens ( 37,52 ms per token, 26,65 tokens per second)
llama_print_timings: eval time = 2228,14 ms / 127 runs ( 17,54 ms per token, 57,00 tokens per second)
llama_print_timings: total time = 2569,25 ms
Maybe it is related to the 3B model. Does it work for you with the 7B model?
Much better results. I think you are right, something with the 3B model doesn't work.
E:\LLaMA\llamacpp>main.exe -ngl 40 -p "Hello, my name is" -m E:\LLaMA\models\orca-mini-v2_7b.ggmlv3.q4_0.bin
main: build = 800 (481f793)
main: seed = 1688746676
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5
llama.cpp: loading model from E:\LLaMA\models\orca-mini-v2_7b.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 1862.39 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 288 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 35/35 layers to GPU
llama_model_load_internal: total VRAM used: 4860 MB
llama_new_context_with_model: kv self size = 256.00 MB
system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0
Hello, my name is Nathalie and I am a French native speaker. I offer a personalized language tutoring service to help you learn or improve your French language skills.
I have a Bachelor's degree in Modern Languages (French and Spanish) and a Master's degree in Teaching English as a Foreign Language. During my studies, I gained experience in teaching languages both in France and in the UK.
I am passionate about languages and love helping others learn and improve their language skills. I have experience working with students of all levels and can tailor my lessons to your specific needs and goals.
I use a variety of methods and materials to teach French, including textbooks, online resources, videos, songs, and role-plays. My lessons are dynamic, engaging, and tailored to the student's interests and learning style.
I look forward to helping you improve your French language skills and sharing my passion for this beautiful language! [end of text]
llama_print_timings: load time = 2464.44 ms
llama_print_timings: sample time = 48.42 ms / 199 runs ( 0.24 ms per token, 4109.87 tokens per second)
llama_print_timings: prompt eval time = 417.64 ms / 6 tokens ( 69.61 ms per token, 14.37 tokens per second)
llama_print_timings: eval time = 8845.96 ms / 198 runs ( 44.68 ms per token, 22.38 tokens per second)
llama_print_timings: total time = 9356.55 ms
E:\LLaMA\llamacpp>
I also noticed bad results with the 3B model with a 3080.
My intuition is that it's an issue with padding when converting the vector to q8_1.
The issue seems to be in the mul mat. Running it with compute-sanitizer
(also add -lineinfo
to NVCCFLAGS
to get line numbers):
========= Invalid __global__ read of size 1 bytes
========= at 0xb10 in /home/slaren/code/llama.cpp/ggml-cuda.cu:1235:vec_dot_q4_0_q8_1(const void *, const block_q8_1 *, int)
========= by thread (24,0,0) in block (0,3199,0)
========= Address 0xdb50d4e02 is out of bounds
========= and is 3 bytes after the nearest allocation at 0xdb4200000 of size 15552000 bytes
========= Device Frame:/home/slaren/code/llama.cpp/ggml-cuda.cu:1392:void mul_mat_vec_q<(int)32, (int)4, block_q4_0, &vec_dot_q4_0_q8_1>(const void *, const void *, float *, int, int) [0xaf0]
OS: Windows 10 LTSC 1809, using provided CU 11.7.1 runtimes from this repo's workflow.
I have a RTX 2060 card, and ever since https://github.com/ggerganov/llama.cpp/pull/2067 was merged, my system generates garbage output with CuBLAS, if any GPU layers are offloaded. This does not happen if GGML_CUDA_FORCE_DMMV is set to true, or if 0 layers are offloaded.
Example output:
another attempt, with fewer layers: