ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.92k stars 9.74k forks source link

Generation with cuBLAS not deterministic for long prompts #1340

Closed JohannesGaessler closed 1 year ago

JohannesGaessler commented 1 year ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Expected Behavior

When I set a seed and repeat a generation with the exact same parameters I expect to get the exact same text again.

Current Behavior

I re-run a generation with the same seed and parameters and the generated text is not always the same between generations. It is sometimes the same, but not always.

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

* git commit: 173d0e6419e8f8f3c1f4f13201b777f4c60629f3 * Physical (or virtual) hardware you are using, e.g. for Linux: `$ lscpu` ```Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 43 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: AuthenticAMD Model name: AMD Ryzen 7 3700X 8-Core Processor CPU family: 23 Model: 113 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 Stepping: 0 Frequency boost: enabled CPU(s) scaling MHz: 77% CPU max MHz: 4935.9370 CPU min MHz: 2200.0000 BogoMIPS: 7202.09 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr _opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalign sse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pst ate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsav ec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umi p rdpid overflow_recov succor smca sev sev_es Virtualization features: Virtualization: AMD-V Caches (sum of all): L1d: 256 KiB (8 instances) L1i: 256 KiB (8 instances) L2: 4 MiB (8 instances) L3: 32 MiB (2 instances) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-15 Vulnerabilities: Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Srbds: Not affected Tsx async abort: Not affected ``` * Operating System, e.g. for Linux: `$ uname -a` `Linux johannes-pc 6.3.0-1-MANJARO #1 SMP PREEMPT_DYNAMIC Mon Apr 3 10:46:56 UTC 2023 x86_64 GNU/Linux` * SDK version, e.g. for Linux: ``` Python 3.10.10 GNU Make 4.4.1 g++ (GCC) 12.2.1 20230201 ```

Failure Information (for bugs)

I suspect that there is a race condition somewhere that affects the generated text, and depending on the race condition one of several outputs is produced. I only get the bug when compiling with LLAMA_CUBLAS=1. I only get the bug with a prompt that is sufficiently long (navy seals copypasta, 399 tokens) but not with a short prompt ("People die when they are killed.", 8 tokens). The number of threads does not matter. Quantization scheme does not matter.

Steps to Reproduce

  1. make clean && LLAMA_CUBLAS=1 make
  2. ./main --model models/llama-33b-ggml-q4_0.bin --ignore-eos --n_predict 16 --ctx_size 2048 --batch_size 512 --threads 6 --seed 1337 --file navy_seals_copypasta.txt with the file navy_seals_copypasta.txt containing the navy seals copypasta as a prompt (399 tokens).
  3. Repeat step 2 and observe that every time one of several generations appears.

Failure Logs

Below is a log of my console when repeatedly running the same seed and parameters. Outputs are in order:

  1. Labels: 4chan, epic win, fail, fun
  2. Labels: 4chan, epic win, fail, fun
  3. (thing) by Kalkin Tue Jul 10
  4. You think this is abuse? This is how I treat people who
``` /home/johannesg/Projects/llama.cpp [git::master *] [johannesg@johannes-pc] [11:32] > ./main --model models/llama-33b-ggml-q4_0.bin --ignore-eos --n_predict 16 --ctx_size 2048 --batch_size 512 --threads 6 --seed 1337 --file navy_seals_copypasta.txt | tee chat.txt main: build = 514 (173d0e6) main: seed = 1337 llama.cpp: loading model from models/llama-33b-ggml-q4_0.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 6656 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 52 llama_model_load_internal: n_layer = 60 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 17920 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 30B llama_model_load_internal: ggml ctx size = 127.27 KB llama_model_load_internal: mem required = 21695.48 MB (+ 3124.00 MB per state) llama_init_from_file: kv self size = 3120.00 MB system_info: n_threads = 6 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 2048, n_batch = 512, n_predict = 16, n_keep = 0 What the fuck did you just fucking say about me, you little bitch? I'll have you know I graduated top of my class in the Navy Seals, and I've been involved in numerous secret raids on Al-Quaeda, and I have over 300 confirmed kills. I am trained in gorilla warfare and I'm the top sniper in the entire US armed forces. You are nothing to me but just another target. I will wipe you the fuck out with precision the likes of which has never been seen before on this Earth, mark my fucking words. You think you can get away with saying that shit to me over the Internet? Think again, fucker. As we speak I am contacting my secret network of spies across the USA and your IP is being traced right now so you better prepare for the storm, maggot. The storm that wipes out the pathetic little thing you call your life. You're fucking dead, kid. I can be anywhere, anytime, and I can kill you in over seven hundred ways, and that's just with my bare hands. Not only am I extensively trained in unarmed combat, but I have access to the entire arsenal of the United States Marine Corps and I will use it to its full extent to wipe your miserable ass off the face of the continent, you little shit. If only you could have known what unholy retribution your little "clever" comment was about to bring down upon you, maybe you would have held your fucking tongue. But you couldn't, you didn't, and now you're paying the price, you goddamn idiot. I will shit fury all over you and you will drown in it. You're fucking dead, kiddo. Labels: 4chan, epic win, fail, fun llama_print_timings: load time = 19322.96 ms nyllama_print_timings: sample time = 9.39 ms / 16 runs ( 0.59 ms per run) llama_print_timings: prompt eval time = 17365.60 ms / 399 tokens ( 43.52 ms per token) llama_print_timings: eval time = 7815.47 ms / 15 runs ( 521.03 ms per run) llama_print_timings: total time = 27151.10 ms /home/johannesg/Projects/llama.cpp [git::master *] [johannesg@johannes-pc] [11:33] > ./main --model models/llama-33b-ggml-q4_0.bin --ignore-eos --n_predict 16 --ctx_size 2048 --batch_size 512 --threads 6 --seed 1337 --file navy_seals_copypasta.txt | tee chat.txt main: build = 514 (173d0e6) main: seed = 1337 llama.cpp: loading model from models/llama-33b-ggml-q4_0.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 6656 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 52 llama_model_load_internal: n_layer = 60 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 17920 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 30B llama_model_load_internal: ggml ctx size = 127.27 KB llama_model_load_internal: mem required = 21695.48 MB (+ 3124.00 MB per state) llama_init_from_file: kv self size = 3120.00 MB system_info: n_threads = 6 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 2048, n_batch = 512, n_predict = 16, n_keep = 0 What the fuck did you just fucking say about me, you little bitch? I'll have you know I graduated top of my class in the Navy Seals, and I've been involved in numerous secret raids on Al-Quaeda, and I have over 300 confirmed kills. I am trained in gorilla warfare and I'm the top sniper in the entire US armed forces. You are nothing to me but just another target. I will wipe you the fuck out with precision the likes of which has never been seen before on this Earth, mark my fucking words. You think you can get away with saying that shit to me over the Internet? Think again, fucker. As we speak I am contacting my secret network of spies across the USA and your IP is being traced right now so you better prepare for the storm, maggot. The storm that wipes out the pathetic little thing you call your life. You're fucking dead, kid. I can be anywhere, anytime, and I can kill you in over seven hundred ways, and that's just with my bare hands. Not only am I extensively trained in unarmed combat, but I have access to the entire arsenal of the United States Marine Corps and I will use it to its full extent to wipe your miserable ass off the face of the continent, you little shit. If only you could have known what unholy retribution your little "clever" comment was about to bring down upon you, maybe you would have held your fucking tongue. But you couldn't, you didn't, and now you're paying the price, you goddamn idiot. I will shit fury all over you and you will drown in it. You're fucking dead, kiddo. Labels: 4chan, epic win, fail, fun nyllama_print_timings: load time = 19352.40 ms llama_print_timings: sample time = 9.50 ms / 16 runs ( 0.59 ms per run) llama_print_timings: prompt eval time = 17379.04 ms / 399 tokens ( 43.56 ms per token) llama_print_timings: eval time = 7831.54 ms / 15 runs ( 522.10 ms per run) llama_print_timings: total time = 27196.73 ms /home/johannesg/Projects/llama.cpp [git::master *] [johannesg@johannes-pc] [11:33] > ./main --model models/llama-33b-ggml-q4_0.bin --ignore-eos --n_predict 16 --ctx_size 2048 --batch_size 512 --threads 6 --seed 1337 --file navy_seals_copypasta.txt | tee chat.txt main: build = 514 (173d0e6) main: seed = 1337 llama.cpp: loading model from models/llama-33b-ggml-q4_0.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 6656 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 52 llama_model_load_internal: n_layer = 60 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 17920 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 30B llama_model_load_internal: ggml ctx size = 127.27 KB llama_model_load_internal: mem required = 21695.48 MB (+ 3124.00 MB per state) llama_init_from_file: kv self size = 3120.00 MB system_info: n_threads = 6 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 2048, n_batch = 512, n_predict = 16, n_keep = 0 What the fuck did you just fucking say about me, you little bitch? I'll have you know I graduated top of my class in the Navy Seals, and I've been involved in numerous secret raids on Al-Quaeda, and I have over 300 confirmed kills. I am trained in gorilla warfare and I'm the top sniper in the entire US armed forces. You are nothing to me but just another target. I will wipe you the fuck out with precision the likes of which has never been seen before on this Earth, mark my fucking words. You think you can get away with saying that shit to me over the Internet? Think again, fucker. As we speak I am contacting my secret network of spies across the USA and your IP is being traced right now so you better prepare for the storm, maggot. The storm that wipes out the pathetic little thing you call your life. You're fucking dead, kid. I can be anywhere, anytime, and I can kill you in over seven hundred ways, and that's just with my bare hands. Not only am I extensively trained in unarmed combat, but I have access to the entire arsenal of the United States Marine Corps and I will use it to its full extent to wipe your miserable ass off the face of the continent, you little shit. If only you could have known what unholy retribution your little "clever" comment was about to bring down upon you, maybe you would have held your fucking tongue. But you couldn't, you didn't, and now you're paying the price, you goddamn idiot. I will shit fury all over you and you will drown in it. You're fucking dead, kiddo. (thing) by Kalkin Tue Jul 10 2llama_print_timings: load time = 19449.27 ms llama_print_timings: sample time = 9.53 ms / 16 runs ( 0.60 ms per run) llama_print_timings: prompt eval time = 17486.82 ms / 399 tokens ( 43.83 ms per token) llama_print_timings: eval time = 7820.27 ms / 15 runs ( 521.35 ms per run) llama_print_timings: total time = 27282.36 ms /home/johannesg/Projects/llama.cpp [git::master *] [johannesg@johannes-pc] [11:34] > ./main --model models/llama-33b-ggml-q4_0.bin --ignore-eos --n_predict 16 --ctx_size 2048 --batch_size 512 --threads 6 --seed 1337 --file navy_seals_copypasta.txt | tee chat.txt main: build = 514 (173d0e6) main: seed = 1337 llama.cpp: loading model from models/llama-33b-ggml-q4_0.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 6656 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 52 llama_model_load_internal: n_layer = 60 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 17920 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 30B llama_model_load_internal: ggml ctx size = 127.27 KB llama_model_load_internal: mem required = 21695.48 MB (+ 3124.00 MB per state) llama_init_from_file: kv self size = 3120.00 MB system_info: n_threads = 6 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 2048, n_batch = 512, n_predict = 16, n_keep = 0 What the fuck did you just fucking say about me, you little bitch? I'll have you know I graduated top of my class in the Navy Seals, and I've been involved in numerous secret raids on Al-Quaeda, and I have over 300 confirmed kills. I am trained in gorilla warfare and I'm the top sniper in the entire US armed forces. You are nothing to me but just another target. I will wipe you the fuck out with precision the likes of which has never been seen before on this Earth, mark my fucking words. You think you can get away with saying that shit to me over the Internet? Think again, fucker. As we speak I am contacting my secret network of spies across the USA and your IP is being traced right now so you better prepare for the storm, maggot. The storm that wipes out the pathetic little thing you call your life. You're fucking dead, kid. I can be anywhere, anytime, and I can kill you in over seven hundred ways, and that's just with my bare hands. Not only am I extensively trained in unarmed combat, but I have access to the entire arsenal of the United States Marine Corps and I will use it to its full extent to wipe your miserable ass off the face of the continent, you little shit. If only you could have known what unholy retribution your little "clever" comment was about to bring down upon you, maybe you would have held your fucking tongue. But you couldn't, you didn't, and now you're paying the price, you goddamn idiot. I will shit fury all over you and you will drown in it. You're fucking dead, kiddo. You think this is abuse? This is how I treat people who sayllama_print_timings: load time = 19359.57 ms llama_print_timings: sample time = 9.34 ms / 16 runs ( 0.58 ms per run) llama_print_timings: prompt eval time = 17398.35 ms / 399 tokens ( 43.60 ms per token) llama_print_timings: eval time = 7865.56 ms / 15 runs ( 524.37 ms per run) llama_print_timings: total time = 27237.87 ms ```
JohannesGaessler commented 1 year ago

I accidentally opened this issue prematurely by pressing CTRL+Enter. I am not yet done with ensuring that everything is correct.

JohannesGaessler commented 1 year ago

Everything should be in order now; sorry for the inconvenience.

slaren commented 1 year ago

Have you noticed if this also happens with smaller models (7B, 13B)?

JohannesGaessler commented 1 year ago

The bug also occurs with 13b:

  1. Dammit, I can't believe I read that whole thing
  2. Dammit, I can't even get angry at this stupid
  3. Dammit, I can't even get angry at this stupid
  4. Dammit, I am so tired of having to deal with people
``` /home/johannesg/Projects/llama.cpp [git::master *] [johannesg@johannes-pc] [15:24] > ./main --model models/llama-13b-ggml-q4_0.bin --ignore-eos --n_predict 16 --ctx_size 2048 --batch_size 512 --threads 6 --seed 1337 --file navy_seals_copypasta.txt | tee chat.txt main: build = 514 (173d0e6) main: seed = 1337 llama.cpp: loading model from models/llama-13b-ggml-q4_0.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 85.08 KB llama_model_load_internal: mem required = 9807.48 MB (+ 1608.00 MB per state) llama_init_from_file: kv self size = 1600.00 MB system_info: n_threads = 6 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 2048, n_batch = 512, n_predict = 16, n_keep = 0 What the fuck did you just fucking say about me, you little bitch? I'll have you know I graduated top of my class in the Navy Seals, and I've been involved in numerous secret raids on Al-Quaeda, and I have over 300 confirmed kills. I am trained in gorilla warfare and I'm the top sniper in the entire US armed forces. You are nothing to me but just another target. I will wipe you the fuck out with precision the likes of which has never been seen before on this Earth, mark my fucking words. You think you can get away with saying that shit to me over the Internet? Think again, fucker. As we speak I am contacting my secret network of spies across the USA and your IP is being traced right now so you better prepare for the storm, maggot. The storm that wipes out the pathetic little thing you call your life. You're fucking dead, kid. I can be anywhere, anytime, and I can kill you in over seven hundred ways, and that's just with my bare hands. Not only am I extensively trained in unarmed combat, but I have access to the entire arsenal of the United States Marine Corps and I will use it to its full extent to wipe your miserable ass off the face of the continent, you little shit. If only you could have known what unholy retribution your little "clever" comment was about to bring down upon you, maybe you would have held your fucking tongue. But you couldn't, you didn't, and now you're paying the price, you goddamn idiot. I will shit fury all over you and you will drown in it. You're fucking dead, kiddo. Dammit, I can't believe I read that whole thing .llama_print_timings: load time = 8611.73 ms llama_print_timings: sample time = 9.20 ms / 16 runs ( 0.58 ms per run) llama_print_timings: prompt eval time = 7279.80 ms / 399 tokens ( 18.25 ms per token) llama_print_timings: eval time = 3326.62 ms / 15 runs ( 221.77 ms per run) llama_print_timings: total time = 11950.80 ms /home/johannesg/Projects/llama.cpp [git::master *] [johannesg@johannes-pc] [15:24] > ./main --model models/llama-13b-ggml-q4_0.bin --ignore-eos --n_predict 16 --ctx_size 2048 --batch_size 512 --threads 6 --seed 1337 --file navy_seals_copypasta.txt | tee chat.txt main: build = 514 (173d0e6) main: seed = 1337 llama.cpp: loading model from models/llama-13b-ggml-q4_0.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 85.08 KB llama_model_load_internal: mem required = 9807.48 MB (+ 1608.00 MB per state) llama_init_from_file: kv self size = 1600.00 MB system_info: n_threads = 6 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 2048, n_batch = 512, n_predict = 16, n_keep = 0 What the fuck did you just fucking say about me, you little bitch? I'll have you know I graduated top of my class in the Navy Seals, and I've been involved in numerous secret raids on Al-Quaeda, and I have over 300 confirmed kills. I am trained in gorilla warfare and I'm the top sniper in the entire US armed forces. You are nothing to me but just another target. I will wipe you the fuck out with precision the likes of which has never been seen before on this Earth, mark my fucking words. You think you can get away with saying that shit to me over the Internet? Think again, fucker. As we speak I am contacting my secret network of spies across the USA and your IP is being traced right now so you better prepare for the storm, maggot. The storm that wipes out the pathetic little thing you call your life. You're fucking dead, kid. I can be anywhere, anytime, and I can kill you in over seven hundred ways, and that's just with my bare hands. Not only am I extensively trained in unarmed combat, but I have access to the entire arsenal of the United States Marine Corps and I will use it to its full extent to wipe your miserable ass off the face of the continent, you little shit. If only you could have known what unholy retribution your little "clever" comment was about to bring down upon you, maybe you would have held your fucking tongue. But you couldn't, you didn't, and now you're paying the price, you goddamn idiot. I will shit fury all over you and you will drown in it. You're fucking dead, kiddo. Dammit, I can't even get angry at this stupid llama_print_timings: load time = 8568.93 ms thingllama_print_timings: sample time = 9.53 ms / 16 runs ( 0.60 ms per run) llama_print_timings: prompt eval time = 7225.52 ms / 399 tokens ( 18.11 ms per token) llama_print_timings: eval time = 3324.20 ms / 15 runs ( 221.61 ms per run) llama_print_timings: total time = 11905.96 ms /home/johannesg/Projects/llama.cpp [git::master *] [johannesg@johannes-pc] [15:24] > ./main --model models/llama-13b-ggml-q4_0.bin --ignore-eos --n_predict 16 --ctx_size 2048 --batch_size 512 --threads 6 --seed 1337 --file navy_seals_copypasta.txt | tee chat.txt main: build = 514 (173d0e6) main: seed = 1337 llama.cpp: loading model from models/llama-13b-ggml-q4_0.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 85.08 KB llama_model_load_internal: mem required = 9807.48 MB (+ 1608.00 MB per state) llama_init_from_file: kv self size = 1600.00 MB system_info: n_threads = 6 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 2048, n_batch = 512, n_predict = 16, n_keep = 0 What the fuck did you just fucking say about me, you little bitch? I'll have you know I graduated top of my class in the Navy Seals, and I've been involved in numerous secret raids on Al-Quaeda, and I have over 300 confirmed kills. I am trained in gorilla warfare and I'm the top sniper in the entire US armed forces. You are nothing to me but just another target. I will wipe you the fuck out with precision the likes of which has never been seen before on this Earth, mark my fucking words. You think you can get away with saying that shit to me over the Internet? Think again, fucker. As we speak I am contacting my secret network of spies across the USA and your IP is being traced right now so you better prepare for the storm, maggot. The storm that wipes out the pathetic little thing you call your life. You're fucking dead, kid. I can be anywhere, anytime, and I can kill you in over seven hundred ways, and that's just with my bare hands. Not only am I extensively trained in unarmed combat, but I have access to the entire arsenal of the United States Marine Corps and I will use it to its full extent to wipe your miserable ass off the face of the continent, you little shit. If only you could have known what unholy retribution your little "clever" comment was about to bring down upon you, maybe you would have held your fucking tongue. But you couldn't, you didn't, and now you're paying the price, you goddamn idiot. I will shit fury all over you and you will drown in it. You're fucking dead, kiddo. Dammit, I can't even get angry at this stupid thingllama_print_timings: load time = 8539.68 ms llama_print_timings: sample time = 9.22 ms / 16 runs ( 0.58 ms per run) llama_print_timings: prompt eval time = 7201.82 ms / 399 tokens ( 18.05 ms per token) llama_print_timings: eval time = 3341.63 ms / 15 runs ( 222.78 ms per run) llama_print_timings: total time = 11893.80 ms /home/johannesg/Projects/llama.cpp [git::master *] [johannesg@johannes-pc] [15:25] > ./main --model models/llama-13b-ggml-q4_0.bin --ignore-eos --n_predict 16 --ctx_size 2048 --batch_size 512 --threads 6 --seed 1337 --file navy_seals_copypasta.txt | tee chat.txt main: build = 514 (173d0e6) main: seed = 1337 llama.cpp: loading model from models/llama-13b-ggml-q4_0.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 85.08 KB llama_model_load_internal: mem required = 9807.48 MB (+ 1608.00 MB per state) llama_init_from_file: kv self size = 1600.00 MB system_info: n_threads = 6 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 2048, n_batch = 512, n_predict = 16, n_keep = 0 What the fuck did you just fucking say about me, you little bitch? I'll have you know I graduated top of my class in the Navy Seals, and I've been involved in numerous secret raids on Al-Quaeda, and I have over 300 confirmed kills. I am trained in gorilla warfare and I'm the top sniper in the entire US armed forces. You are nothing to me but just another target. I will wipe you the fuck out with precision the likes of which has never been seen before on this Earth, mark my fucking words. You think you can get away with saying that shit to me over the Internet? Think again, fucker. As we speak I am contacting my secret network of spies across the USA and your IP is being traced right now so you better prepare for the storm, maggot. The storm that wipes out the pathetic little thing you call your life. You're fucking dead, kid. I can be anywhere, anytime, and I can kill you in over seven hundred ways, and that's just with my bare hands. Not only am I extensively trained in unarmed combat, but I have access to the entire arsenal of the United States Marine Corps and I will use it to its full extent to wipe your miserable ass off the face of the continent, you little shit. If only you could have known what unholy retribution your little "clever" comment was about to bring down upon you, maybe you would have held your fucking tongue. But you couldn't, you didn't, and now you're paying the price, you goddamn idiot. I will shit fury all over you and you will drown in it. You're fucking dead, kiddo. Dammit, I am so tired of having to deal with people likellama_print_timings: load time = 8554.82 ms llama_print_timings: sample time = 9.49 ms / 16 runs ( 0.59 ms per run) llama_print_timings: prompt eval time = 7198.63 ms / 399 tokens ( 18.04 ms per token) llama_print_timings: eval time = 3331.95 ms / 15 runs ( 222.13 ms per run) llama_print_timings: total time = 11899.47 ms /home/johannesg/Projects/llama.cpp [git::master *] [johannesg@johannes-pc] [15:25] > ```
slaren commented 1 year ago

I think there are two possible causes for this:

Adding a cudaDeviceSynchronize() at the end of the inner loop of the ggml_cuda_mul_mat_<f32/f16/q_f32> functions, forcing each mat mul to be performed sequentially, may fix this at the cost of performance. Using a different cuBLAS handle per stream may also work, but can also affect performance.

Unless there is a bug with the multi-stream synchronization, I am not sure that we should do anything about it, unless this affects the generation quality. Note that the generation quality needs to be evaluated in an objective way, such as the perplexity.

JohannesGaessler commented 1 year ago

I can confirm that the bug also occurs at 7b:

  1. Labels: Al-Quaeda, Armed Forces, Gor
  2. "Dead" as in "deceased"? Wow
  3. Labels: Angry, Dumbasses, Funny,
  4. Labels: Angry, Dumbasses, Funny,
``` /home/johannesg/Projects/llama.cpp [git::master *] [johannesg@johannes-pc] [15:51] > ./main --model models/llama-7b-ggml-q4_0.bin --ignore-eos --n_predict 16 --ctx_size 2048 --batch_size 512 --threads 6 --seed 1337 --file navy_seals_copypasta.txt | tee chat.txt main: build = 514 (173d0e6) main: seed = 1337 llama.cpp: loading model from models/llama-7b-ggml-q4_0.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 68.20 KB llama_model_load_internal: mem required = 5809.33 MB (+ 1026.00 MB per state) llama_init_from_file: kv self size = 1024.00 MB system_info: n_threads = 6 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 2048, n_batch = 512, n_predict = 16, n_keep = 0 What the fuck did you just fucking say about me, you little bitch? I'll have you know I graduated top of my class in the Navy Seals, and I've been involved in numerous secret raids on Al-Quaeda, and I have over 300 confirmed kills. I am trained in gorilla warfare and I'm the top sniper in the entire US armed forces. You are nothing to me but just another target. I will wipe you the fuck out with precision the likes of which has never been seen before on this Earth, mark my fucking words. You think you can get away with saying that shit to me over the Internet? Think again, fucker. As we speak I am contacting my secret network of spies across the USA and your IP is being traced right now so you better prepare for the storm, maggot. The storm that wipes out the pathetic little thing you call your life. You're fucking dead, kid. I can be anywhere, anytime, and I can kill you in over seven hundred ways, and that's just with my bare hands. Not only am I extensively trained in unarmed combat, but I have access to the entire arsenal of the United States Marine Corps and I will use it to its full extent to wipe your miserable ass off the face of the continent, you little shit. If only you could have known what unholy retribution your little "clever" comment was about to bring down upon you, maybe you would have held your fucking tongue. But you couldn't, you didn't, and now you're paying the price, you goddamn idiot. I will shit fury all over you and you will drown in it. You're fucking dead, kiddo. Labels: Al-Quaeda, Armed Forces, Gor illallama_print_timings: load time = 4975.57 ms llama_print_timings: sample time = 9.52 ms / 16 runs ( 0.60 ms per run) llama_print_timings: prompt eval time = 3880.70 ms / 399 tokens ( 9.73 ms per token) llama_print_timings: eval time = 1799.05 ms / 15 runs ( 119.94 ms per run) llama_print_timings: total time = 6787.42 ms /home/johannesg/Projects/llama.cpp [git::master *] [johannesg@johannes-pc] [15:52] > ./main --model models/llama-7b-ggml-q4_0.bin --ignore-eos --n_predict 16 --ctx_size 2048 --batch_size 512 --threads 6 --seed 1337 --file navy_seals_copypasta.txt | tee chat.txt main: build = 514 (173d0e6) main: seed = 1337 llama.cpp: loading model from models/llama-7b-ggml-q4_0.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 68.20 KB llama_model_load_internal: mem required = 5809.33 MB (+ 1026.00 MB per state) llama_init_from_file: kv self size = 1024.00 MB system_info: n_threads = 6 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 2048, n_batch = 512, n_predict = 16, n_keep = 0 What the fuck did you just fucking say about me, you little bitch? I'll have you know I graduated top of my class in the Navy Seals, and I've been involved in numerous secret raids on Al-Quaeda, and I have over 300 confirmed kills. I am trained in gorilla warfare and I'm the top sniper in the entire US armed forces. You are nothing to me but just another target. I will wipe you the fuck out with precision the likes of which has never been seen before on this Earth, mark my fucking words. You think you can get away with saying that shit to me over the Internet? Think again, fucker. As we speak I am contacting my secret network of spies across the USA and your IP is being traced right now so you better prepare for the storm, maggot. The storm that wipes out the pathetic little thing you call your life. You're fucking dead, kid. I can be anywhere, anytime, and I can kill you in over seven hundred ways, and that's just with my bare hands. Not only am I extensively trained in unarmed combat, but I have access to the entire arsenal of the United States Marine Corps and I will use it to its full extent to wipe your miserable ass off the face of the continent, you little shit. If only you could have known what unholy retribution your little "clever" comment was about to bring down upon you, maybe you would have held your fucking tongue. But you couldn't, you didn't, and now you're paying the price, you goddamn idiot. I will shit fury all over you and you will drown in it. You're fucking dead, kiddo. "Dead" as in "deceased"? Wow .llama_print_timings: load time = 4991.70 ms llama_print_timings: sample time = 9.42 ms / 16 runs ( 0.59 ms per run) llama_print_timings: prompt eval time = 3880.76 ms / 399 tokens ( 9.73 ms per token) llama_print_timings: eval time = 1808.76 ms / 15 runs ( 120.58 ms per run) llama_print_timings: total time = 6813.24 ms /home/johannesg/Projects/llama.cpp [git::master *] [johannesg@johannes-pc] [15:52] > ./main --model models/llama-7b-ggml-q4_0.bin --ignore-eos --n_predict 16 --ctx_size 2048 --batch_size 512 --threads 6 --seed 1337 --file navy_seals_copypasta.txt | tee chat.txt main: build = 514 (173d0e6) main: seed = 1337 llama.cpp: loading model from models/llama-7b-ggml-q4_0.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 68.20 KB llama_model_load_internal: mem required = 5809.33 MB (+ 1026.00 MB per state) llama_init_from_file: kv self size = 1024.00 MB system_info: n_threads = 6 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 2048, n_batch = 512, n_predict = 16, n_keep = 0 What the fuck did you just fucking say about me, you little bitch? I'll have you know I graduated top of my class in the Navy Seals, and I've been involved in numerous secret raids on Al-Quaeda, and I have over 300 confirmed kills. I am trained in gorilla warfare and I'm the top sniper in the entire US armed forces. You are nothing to me but just another target. I will wipe you the fuck out with precision the likes of which has never been seen before on this Earth, mark my fucking words. You think you can get away with saying that shit to me over the Internet? Think again, fucker. As we speak I am contacting my secret network of spies across the USA and your IP is being traced right now so you better prepare for the storm, maggot. The storm that wipes out the pathetic little thing you call your life. You're fucking dead, kid. I can be anywhere, anytime, and I can kill you in over seven hundred ways, and that's just with my bare hands. Not only am I extensively trained in unarmed combat, but I have access to the entire arsenal of the United States Marine Corps and I will use it to its full extent to wipe your miserable ass off the face of the continent, you little shit. If only you could have known what unholy retribution your little "clever" comment was about to bring down upon you, maybe you would have held your fucking tongue. But you couldn't, you didn't, and now you're paying the price, you goddamn idiot. I will shit fury all over you and you will drown in it. You're fucking dead, kiddo. Labels: Angry, Dumbasses, Funny, Mllama_print_timings: load time = 4977.78 ms llama_print_timings: sample time = 9.60 ms / 16 runs ( 0.60 ms per run) llama_print_timings: prompt eval time = 3883.38 ms / 399 tokens ( 9.73 ms per token) llama_print_timings: eval time = 1797.12 ms / 15 runs ( 119.81 ms per run) llama_print_timings: total time = 6787.80 ms /home/johannesg/Projects/llama.cpp [git::master *] [johannesg@johannes-pc] [15:52] > ./main --model models/llama-7b-ggml-q4_0.bin --ignore-eos --n_predict 16 --ctx_size 2048 --batch_size 512 --threads 6 --seed 1337 --file navy_seals_copypasta.txt | tee chat.txt main: build = 514 (173d0e6) main: seed = 1337 llama.cpp: loading model from models/llama-7b-ggml-q4_0.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 68.20 KB llama_model_load_internal: mem required = 5809.33 MB (+ 1026.00 MB per state) llama_init_from_file: kv self size = 1024.00 MB system_info: n_threads = 6 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 2048, n_batch = 512, n_predict = 16, n_keep = 0 What the fuck did you just fucking say about me, you little bitch? I'll have you know I graduated top of my class in the Navy Seals, and I've been involved in numerous secret raids on Al-Quaeda, and I have over 300 confirmed kills. I am trained in gorilla warfare and I'm the top sniper in the entire US armed forces. You are nothing to me but just another target. I will wipe you the fuck out with precision the likes of which has never been seen before on this Earth, mark my fucking words. You think you can get away with saying that shit to me over the Internet? Think again, fucker. As we speak I am contacting my secret network of spies across the USA and your IP is being traced right now so you better prepare for the storm, maggot. The storm that wipes out the pathetic little thing you call your life. You're fucking dead, kid. I can be anywhere, anytime, and I can kill you in over seven hundred ways, and that's just with my bare hands. Not only am I extensively trained in unarmed combat, but I have access to the entire arsenal of the United States Marine Corps and I will use it to its full extent to wipe your miserable ass off the face of the continent, you little shit. If only you could have known what unholy retribution your little "clever" comment was about to bring down upon you, maybe you would have held your fucking tongue. But you couldn't, you didn't, and now you're paying the price, you goddamn idiot. I will shit fury all over you and you will drown in it. You're fucking dead, kiddo. Labels: Angry, Dumbasses, Funny, Mllama_print_timings: load time = 5017.26 ms llama_print_timings: sample time = 9.44 ms / 16 runs ( 0.59 ms per run) llama_print_timings: prompt eval time = 3915.41 ms / 399 tokens ( 9.81 ms per token) llama_print_timings: eval time = 1795.88 ms / 15 runs ( 119.73 ms per run) llama_print_timings: total time = 6825.87 ms /home/johannesg/Projects/llama.cpp [git::master *] [johannesg@johannes-pc] [15:52] > ```

I did not do any objective measure of generation quality. Subjectively I was not able to tell a difference in terms of quality. In any case, if cuBLAS does not guarantee reproducibility anyways then this is probably the reason. I was simply confused because this behavior made me question whether I accidentally introduced race conditions in https://github.com/ggerganov/llama.cpp/pull/1341 ; perhaps a warning should be printed when the user specifies a seed in combination with cuBLAS? In any case, I agree that this would probably not be worth sacrificing performance for.

JohannesGaessler commented 1 year ago

Adding cudaDeviceSynchronize() in the loop does not make a difference. When I set GGML_CUDA_MAX_STREAMS to 1 the outputs become deterministic. In turn prompt processing seems to become ~1-2% slower. I think it's sufficient to somehow document this behavior.

JohannesGaessler commented 1 year ago

I just ran perplexity tests for 8 CUDA streams vs. 1 stream. The perplexity of 7b q4_0 was 6.2838 for both configurations. 8 streams was 6% faster than 1 stream with 8.66 ms / token vs. 9.20 ms / token.