containers / podman-desktop-extension-ai-lab

Work with LLMs on a local environment using containers
https://podman-desktop.io/extensions/ai-lab
Apache License 2.0
181 stars 37 forks source link

Expose model server metrics in model playground #438

Open MichaelClifford opened 8 months ago

MichaelClifford commented 8 months ago

Each time an LLM responds, it also outputs some info about its performance.

llama_print_timings:        load time =    4732.44 ms
llama_print_timings:      sample time =      86.82 ms /   485 runs   (    0.18 ms per token,  5586.14 tokens per second)
llama_print_timings: prompt eval time =    1997.60 ms /     2 tokens (  998.80 ms per token,     1.00 tokens per second)
llama_print_timings:        eval time =   20404.39 ms /   484 runs   (   42.16 ms per token,    23.72 tokens per second)
llama_print_timings:       total time =   22575.28 ms /   486 tokens

It would be great if a subset of this information could be exposed to the user on the playground page. Adding the prompt eval time tokens per second and eval time tokens per second would give the user a good sense of how the model is performing on their machine.

jeffmaury commented 8 months ago

Such metrics are not exposed in a standard manner. We could expose the number of token processed (prompt and response) in addition to the elasped time that is already displayed

MichaelClifford commented 8 months ago

Ok. number of tokens processed would be good too. The issue with only providing total time is that it is dependent on the number of tokens generated, so a response that is twice as long takes roughly twice as much time. But that is not telling you much about the performance of your model server.

If you could display total tokens processed / total time that would be good. However, as you can see from the outputs above different phases of the inference process have different performance values, so this would only be an estimate.

What would need to happen to get these values exposed in a standard manner?

jeffmaury commented 8 months ago

For the tokens number we can get it from the JSON payload For extra information we need to find a way to get it that is not llama cpp specific

nichjones1 commented 3 months ago

Postponed as we need the feature in the LLama CPP Code base

jeffmaury commented 3 months ago

Upstream PR: https://github.com/abetlen/llama-cpp-python/pull/1552