abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
8.17k stars 974 forks source link

"eval time" and "prompt eval time" is 0.00ms after Ver0.3.0 #1830

Open nai-kon opened 1 week ago

nai-kon commented 1 week ago

After Ver0.3.0, "eval time" and "prompt eval time" of llama_print_timings are displayed as 0.00ms. I thought it was a problem with llama.cpp, but it was displayed correctly.

Here is a code and results.

from llama_cpp import Llama

model = Llama(
    model_path="Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf",
)
output = model(
    prompt="Q: Name the planets in the solar system? A: ",
    max_tokens=128, # Generate up to 32 tokens, set to None to generate up to the end of the context window
    stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
    echo=True # Echo the prompt back in the output
)
print(output)

llama-cpp-python Ver0.2.90

llama_print_timings:      sample time =       2.79 ms /    33 runs   (    0.08 ms per token, 11819.48 tokens per second)
llama_print_timings: prompt eval time =   14805.69 ms /    13 tokens ( 1138.90 ms per token,     0.88 tokens per second)
llama_print_timings:        eval time =    3430.58 ms /    32 runs   (  107.21 ms per token,     9.33 tokens per second)
llama_print_timings:       total time =   18278.73 ms /    45 tokens

llama-cpp-python Ver0.3.0

llama_perf_context_print:        load time =   14788.07 ms
llama_perf_context_print: prompt eval time =       0.00 ms /    13 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /    48 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   20017.62 ms /    61 tokens

llama-cpp-python Ver0.3.1

llama_perf_context_print:        load time =   14937.34 ms
llama_perf_context_print: prompt eval time =       0.00 ms /    13 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /    48 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   20313.54 ms /    61 tokens

llama.cpp of latest master

exec command: ./llama-cli -m Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf -p "Q: Name the planets in the solar system? A: " -n 400 -e

llama_perf_context_print:        load time =    1450.48 ms
llama_perf_context_print: prompt eval time =     600.72 ms /    13 tokens (   46.21 ms per token,    21.64 tokens per second)
llama_perf_context_print:        eval time =   42424.41 ms /   399 runs   (  106.33 ms per token,     9.40 tokens per second)
llama_perf_context_print:       total time =   43197.46 ms /   412 tokens