LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
5.15k stars 353 forks source link

SSE Stream add wait time input in GUI for the API #1064

Open hentaitaku opened 2 months ago

hentaitaku commented 2 months ago

Hello can you add to the GUI a input where you can change this wait time, so i dont need to build kobold for windows on each new version. When i translate text with a gguf model is it done in 300-400ms, but the 0.25 wait time is to long works by me with 0.05.

https://github.com/LostRuins/koboldcpp/blob/c7108742f43e854b852c357c56cf11b5c637f188/koboldcpp.py#L1426

And i found out that when i set my nvidia power management mode from normal to maximum performance is my generate 50% faster. https://nvidia.custhelp.com/app/answers/detail/a_id/3130/~/setting-power-management-mode-from-normal-to-maximum-performance

Many Thanks

LostRuins commented 2 months ago

If you're running fast bulk koboldcpp requests, why not use the sync generate endpoint instead of SSE streaming? That will not have the delay. The delay is necessary to prevent the streaming from overtaking the generate function and potentially returning invalid results.

hentaitaku commented 2 months ago

Can we not add a check to find out if generate has started and not some wait time because by me it works with 0.05ms wait.

Can this not work add a var before handle.generate() and read it in handle_sse_stream() so it knows generate started: Best is when you can add in handle.generate() a return var after first token generated.

global sse_generate_started sse_generate_started = True ret = handle.generate(inputs) outstr = "" if ret.status==1:

LostRuins commented 2 months ago

Generate has started because the function is always called before the SSE streaming function. The thing is you can't be sure that the old buffer is cleared on time before you start the streaming thread.

Do note that this does not actually slow down the generation speed. Even though the streaming thread starts later, it will catch up with the already generated tokens, and the total speed should be about the same.