LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.83k stars 342 forks source link

Suggestion: output the "generating" portion in the terminal even if --quiet is set #1115

Open morbidCode opened 1 week ago

morbidCode commented 1 week ago

Hello. Usually, if --quiet is not set, we usually get this during inference:

generating: 12/512 tokens

but this also outputs the prompts and the response. On the other hand, if --quiet is set, it silences everything except the stats in the end of the response.

Would it be possible to output the "generating" portion even if quiet is set? I think this should not consume too much lines in the terminal since it is updating in place, not creating a new line. The use case for this is for very slow models, it would be nice to see if it is about to finish (like generating: 500/512 tokens), and for non-streaming setups, to see if it is inferencing at all. Thanks!

LostRuins commented 1 week ago

Generally --quiet's goal is to minimize terminal output, so If this is added in future it would be within the API only. You can currently query the /api/extra/perf/ to determine if there is a request in progress, although token information is not available until generation is complete (unless you use polled streaming)

morbidCode commented 1 week ago

Got it. But will the multiuser flag impact the API call? Suppose I am inference with kobold UI, and then I call /api/extra/perf/. Will koboldcpp clasify it as 2 users? Since the default value of multiuser is 1

LostRuins commented 1 week ago

It will be fine