ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.87k stars 9.73k forks source link

Bug: Long sample times with --top-k 0 #8988

Closed Azirine closed 3 months ago

Azirine commented 3 months ago

What happened?

Sample times are greatly increased with --top-k 0, especially with Gemma models.

Name and Version

version: 3570 (4134999e) built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.6.0

What operating system are you seeing the problem on?

Mac

Relevant log output

./llama-cli -m gemma-2-2b-it-Q8_0.gguf --no-mmap -n 512 -p "<start_of_turn>user\nWrite a story.<end_of_turn>\n<start_of_turn>model\n" -s 0
llama_print_timings:        load time =     586.20 ms
llama_print_timings:      sample time =      38.75 ms /   512 runs   (    0.08 ms per token, 13214.27 tokens per second)
llama_print_timings: prompt eval time =      34.71 ms /    13 tokens (    2.67 ms per token,   374.58 tokens per second)
llama_print_timings:        eval time =    6059.14 ms /   511 runs   (   11.86 ms per token,    84.34 tokens per second)
llama_print_timings:       total time =    6204.28 ms /   524 tokens

./llama-cli -m gemma-2-2b-it-Q8_0.gguf --no-mmap -n 512 -p "<start_of_turn>user\nWrite a story.<end_of_turn>\n<start_of_turn>model\n" -s 0  --top-k 0
llama_print_timings:        load time =     605.89 ms
llama_print_timings:      sample time =    6788.36 ms /   512 runs   (   13.26 ms per token,    75.42 tokens per second)
llama_print_timings: prompt eval time =      34.69 ms /    13 tokens (    2.67 ms per token,   374.77 tokens per second)
llama_print_timings:        eval time =    6463.94 ms /   511 runs   (   12.65 ms per token,    79.05 tokens per second)
llama_print_timings:       total time =   13361.44 ms /   524 tokens

./llama-cli -m gemma-2-9b-it-Q8_0.gguf --no-mmap -n 256 -p "<start_of_turn>user\nWrite a story.<end_of_turn>\n<start_of_turn>model\n" -s 0
llama_print_timings:        load time =    1730.14 ms
llama_print_timings:      sample time =      19.23 ms /   256 runs   (    0.08 ms per token, 13309.07 tokens per second)
llama_print_timings: prompt eval time =      91.49 ms /    13 tokens (    7.04 ms per token,   142.09 tokens per second)
llama_print_timings:        eval time =    8442.00 ms /   255 runs   (   33.11 ms per token,    30.21 tokens per second)
llama_print_timings:       total time =    8588.98 ms /   268 tokens

./llama-cli -m gemma-2-9b-it-Q8_0.gguf --no-mmap -n 256 -p "<start_of_turn>user\nWrite a story.<end_of_turn>\n<start_of_turn>model\n" -s 0  --top-k 0
llama_print_timings:        load time =    1747.26 ms
llama_print_timings:      sample time =    4455.29 ms /   256 runs   (   17.40 ms per token,    57.46 tokens per second)
llama_print_timings: prompt eval time =      91.12 ms /    13 tokens (    7.01 ms per token,   142.67 tokens per second)
llama_print_timings:        eval time =    8631.37 ms /   255 runs   (   33.85 ms per token,    29.54 tokens per second)
llama_print_timings:       total time =   13216.80 ms /   268 tokens
Azirine commented 3 months ago

Other models are also affected, but not by as much.

./llama-cli -m Meta-Llama-3.1-8B-Instruct.Q8_0.gguf --no-mmap -fa -c 8192 -n 256 -p "<|start_header_id|>user<|end_header_id|>\n\nWrite a story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -s 0
llama_print_timings:        load time =    1279.15 ms
llama_print_timings:      sample time =      10.33 ms /   256 runs   (    0.04 ms per token, 24782.19 tokens per second)
llama_print_timings: prompt eval time =      71.23 ms /    14 tokens (    5.09 ms per token,   196.56 tokens per second)
llama_print_timings:        eval time =    6348.85 ms /   255 runs   (   24.90 ms per token,    40.16 tokens per second)
llama_print_timings:       total time =    6451.53 ms /   269 tokens

./llama-cli -m Meta-Llama-3.1-8B-Instruct.Q8_0.gguf --no-mmap -fa -c 8192 -n 256 -p "<|start_header_id|>user<|end_header_id|>\n\nWrite a story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -s 0 --top-k 0
llama_print_timings:        load time =    1767.77 ms
llama_print_timings:      sample time =    1116.17 ms /   256 runs   (    4.36 ms per token,   229.36 tokens per second)
llama_print_timings: prompt eval time =      70.73 ms /    14 tokens (    5.05 ms per token,   197.95 tokens per second)
llama_print_timings:        eval time =    6539.61 ms /   255 runs   (   25.65 ms per token,    38.99 tokens per second)
llama_print_timings:       total time =    7747.59 ms /   269 tokens

./llama-cli -m Mistral-Nemo-Instruct-2407-Q8_0.gguf --no-mmap -fa -c 8192 -n 256 -p "[INST]Write a story.[/INST]" -s 0
llama_print_timings:        load time =    1986.72 ms
llama_print_timings:      sample time =      11.89 ms /   256 runs   (    0.05 ms per token, 21534.32 tokens per second)
llama_print_timings: prompt eval time =     108.87 ms /     7 tokens (   15.55 ms per token,    64.30 tokens per second)
llama_print_timings:        eval time =    9471.36 ms /   255 runs   (   37.14 ms per token,    26.92 tokens per second)
llama_print_timings:       total time =    9614.94 ms /   262 tokens

./llama-cli -m Mistral-Nemo-Instruct-2407-Q8_0.gguf --no-mmap -fa -c 8192 -n 256 -p "[INST]Write a story.[/INST]" -s 0 --top-k 0
llama_print_timings:        load time =    1980.47 ms
llama_print_timings:      sample time =    1244.84 ms /   256 runs   (    4.86 ms per token,   205.65 tokens per second)
llama_print_timings: prompt eval time =     108.71 ms /     7 tokens (   15.53 ms per token,    64.39 tokens per second)
llama_print_timings:        eval time =    9647.18 ms /   255 runs   (   37.83 ms per token,    26.43 tokens per second)
llama_print_timings:       total time =   11023.88 ms /   262 tokens
ggerganov commented 3 months ago

With K=0, the entire vocab will be sorted:

https://github.com/ggerganov/llama.cpp/blob/4134999e01f31256b15342b41c4de9e2477c4a6c/src/llama-sampling.cpp#L69-L71

So a slowdown is expected - the larger the vocab, the larger the slowdown