Closed Azirine closed 3 months ago
Other models are also affected, but not by as much.
./llama-cli -m Meta-Llama-3.1-8B-Instruct.Q8_0.gguf --no-mmap -fa -c 8192 -n 256 -p "<|start_header_id|>user<|end_header_id|>\n\nWrite a story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -s 0
llama_print_timings: load time = 1279.15 ms
llama_print_timings: sample time = 10.33 ms / 256 runs ( 0.04 ms per token, 24782.19 tokens per second)
llama_print_timings: prompt eval time = 71.23 ms / 14 tokens ( 5.09 ms per token, 196.56 tokens per second)
llama_print_timings: eval time = 6348.85 ms / 255 runs ( 24.90 ms per token, 40.16 tokens per second)
llama_print_timings: total time = 6451.53 ms / 269 tokens
./llama-cli -m Meta-Llama-3.1-8B-Instruct.Q8_0.gguf --no-mmap -fa -c 8192 -n 256 -p "<|start_header_id|>user<|end_header_id|>\n\nWrite a story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -s 0 --top-k 0
llama_print_timings: load time = 1767.77 ms
llama_print_timings: sample time = 1116.17 ms / 256 runs ( 4.36 ms per token, 229.36 tokens per second)
llama_print_timings: prompt eval time = 70.73 ms / 14 tokens ( 5.05 ms per token, 197.95 tokens per second)
llama_print_timings: eval time = 6539.61 ms / 255 runs ( 25.65 ms per token, 38.99 tokens per second)
llama_print_timings: total time = 7747.59 ms / 269 tokens
./llama-cli -m Mistral-Nemo-Instruct-2407-Q8_0.gguf --no-mmap -fa -c 8192 -n 256 -p "[INST]Write a story.[/INST]" -s 0
llama_print_timings: load time = 1986.72 ms
llama_print_timings: sample time = 11.89 ms / 256 runs ( 0.05 ms per token, 21534.32 tokens per second)
llama_print_timings: prompt eval time = 108.87 ms / 7 tokens ( 15.55 ms per token, 64.30 tokens per second)
llama_print_timings: eval time = 9471.36 ms / 255 runs ( 37.14 ms per token, 26.92 tokens per second)
llama_print_timings: total time = 9614.94 ms / 262 tokens
./llama-cli -m Mistral-Nemo-Instruct-2407-Q8_0.gguf --no-mmap -fa -c 8192 -n 256 -p "[INST]Write a story.[/INST]" -s 0 --top-k 0
llama_print_timings: load time = 1980.47 ms
llama_print_timings: sample time = 1244.84 ms / 256 runs ( 4.86 ms per token, 205.65 tokens per second)
llama_print_timings: prompt eval time = 108.71 ms / 7 tokens ( 15.53 ms per token, 64.39 tokens per second)
llama_print_timings: eval time = 9647.18 ms / 255 runs ( 37.83 ms per token, 26.43 tokens per second)
llama_print_timings: total time = 11023.88 ms / 262 tokens
With K=0, the entire vocab will be sorted:
So a slowdown is expected - the larger the vocab, the larger the slowdown
What happened?
Sample times are greatly increased with --top-k 0, especially with Gemma models.
Name and Version
version: 3570 (4134999e) built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.6.0
What operating system are you seeing the problem on?
Mac
Relevant log output