Open MichaelClifford opened 8 months ago
Such metrics are not exposed in a standard manner. We could expose the number of token processed (prompt and response) in addition to the elasped time that is already displayed
Ok. number of tokens processed would be good too. The issue with only providing total time is that it is dependent on the number of tokens generated, so a response that is twice as long takes roughly twice as much time. But that is not telling you much about the performance of your model server.
If you could display total tokens processed / total time
that would be good. However, as you can see from the outputs above different phases of the inference process have different performance values, so this would only be an estimate.
What would need to happen to get these values exposed in a standard manner?
For the tokens number we can get it from the JSON payload For extra information we need to find a way to get it that is not llama cpp specific
Postponed as we need the feature in the LLama CPP Code base
Each time an LLM responds, it also outputs some info about its performance.
It would be great if a subset of this information could be exposed to the user on the playground page. Adding the
prompt eval time tokens per second
andeval time tokens per second
would give the user a good sense of how the model is performing on their machine.