After Ver0.3.0, "eval time" and "prompt eval time" of llama_print_timings are displayed as 0.00ms.
I thought it was a problem with llama.cpp, but it was displayed correctly.
Here is a code and results.
from llama_cpp import Llama
model = Llama(
model_path="Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf",
)
output = model(
prompt="Q: Name the planets in the solar system? A: ",
max_tokens=128, # Generate up to 32 tokens, set to None to generate up to the end of the context window
stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
echo=True # Echo the prompt back in the output
)
print(output)
Ubuntu 22.04
Python 3.10.12
llama-cpp-python Ver0.2.90
llama_print_timings: sample time = 2.79 ms / 33 runs ( 0.08 ms per token, 11819.48 tokens per second)
llama_print_timings: prompt eval time = 14805.69 ms / 13 tokens ( 1138.90 ms per token, 0.88 tokens per second)
llama_print_timings: eval time = 3430.58 ms / 32 runs ( 107.21 ms per token, 9.33 tokens per second)
llama_print_timings: total time = 18278.73 ms / 45 tokens
llama-cpp-python Ver0.3.0
llama_perf_context_print: load time = 14788.07 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 13 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 48 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 20017.62 ms / 61 tokens
llama-cpp-python Ver0.3.1
llama_perf_context_print: load time = 14937.34 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 13 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 48 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 20313.54 ms / 61 tokens
llama.cpp of latest master
exec command: ./llama-cli -m Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf -p "Q: Name the planets in the solar system? A: " -n 400 -e
llama_perf_context_print: load time = 1450.48 ms
llama_perf_context_print: prompt eval time = 600.72 ms / 13 tokens ( 46.21 ms per token, 21.64 tokens per second)
llama_perf_context_print: eval time = 42424.41 ms / 399 runs ( 106.33 ms per token, 9.40 tokens per second)
llama_perf_context_print: total time = 43197.46 ms / 412 tokens
After Ver0.3.0, "eval time" and "prompt eval time" of llama_print_timings are displayed as 0.00ms. I thought it was a problem with llama.cpp, but it was displayed correctly.
Here is a code and results.
llama-cpp-python Ver0.2.90
llama-cpp-python Ver0.3.0
llama-cpp-python Ver0.3.1
llama.cpp of latest master
exec command:
./llama-cli -m Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf -p "Q: Name the planets in the solar system? A: " -n 400 -e