LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.66k stars 334 forks source link

Slow Output Generation and Stalling in Kobold CPP when Console Window is Minimized or Occluded by Web Browser #187

Closed Khrspk closed 1 year ago

Khrspk commented 1 year ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Expected Behavior

Generate outputs at the same speed as when the console window is open.

Current Behavior

The generation will be very slow and often will just stop until you open the console window again.

Environment and Context

Windows 11 RTX 3070 TI RAM 32GB 12th Gen Intel(R) Core(TM) i7-12700H, 2300 Mhz

Failure Information (for bugs)

When using Kobold CPP, the output generation becomes significantly slow and often stops altogether when the console window is minimized or occluded by clicking on a web browser window. The output generation can take more than 10 minutes (when the window is not minimized it just take some seconds), often the output generation comes to a complete halt until the console window is clicked again. (This don't happen when using regular kobold, so maybe it can be attributed to the way the Kobold CPP compiler interacts with the operating system and how it handles background processes.)

Steps to Reproduce

  1. Send a prompt
  2. Leave the console window open and observe the output generation time.
  3. Send another prompt
  4. Minimize the console window or click on a web browser window to occlude it.
  5. Observe the drastic increase in output generation time
simulanics commented 1 year ago

For Windows - If you want to see even more dramatic increase in speed, use a headless console/shell (meaning does not use powershell.exe or cmd to initiate the execution of the binary). Also, eliminate use of any webbrowser greatly increases speed since Chrome/Edge/Mozilla/etc. automatically reserve memory and GPU VRAM, leaving less resources for text generation and prompt processing. This is the basis of one of the projects I'm working on that utilizes Kobold. Its a compiled piece of software that uses native system controls and accesses the api after starting kobold in a headless console. Speed nearly doubles using this method. If you have a GPU and can use the --gpulayers flag, you'll notice even more improvement in speed. I haven't used "regular kobold" or even familiar with it - but I bet it's an application vs a web-browser and may even use the same method I'm using currently. See my other threads for details. If this is the case, it is not a bug rather how operating systems and memory allocation/resources work. On one of our systems with half your resources, we're getting text generations at about 50-150ms /token with a 13B model, so something definitely doesn't seem right with the 10-minute generation you're experiencing.

What OS do you use? How do you start Kobold?

If you're dropping the model on the exe, open a command prompt instead and try:

koboldcpp.exe --model YourModel.bin --threads 4 --stream --highpriority --smartcontext --blasbatchsize 1024 --blasthreads 4 --useclblast 0 0 --gpulayers 8

**This command seems to run beautifully on our oldest AMD system (made 2008). On our newer systems, some values are increased.

**Depending upon your system resources/setup, you may need to change the last 4 parameters of the command I provided above - if these parameters are wrong, koboldcpp will hard crash immediately without warning. If the gpu layers are wrong, it may even wait to crash until a text generation is requested, as it runs out of VRAM to proceed.

LostRuins commented 1 year ago

Hmm I have heard of this issue before. It seems to be caused by the circle animation. Do you have an adblocker? If yes, can you try to block the spinning circle element and see if your speeds improve?

Khrspk commented 1 year ago

Clarifying about the Simulanics comment. I'm running on windows 11, and because of that at first I thought it would be a problem related to the fact that in windows 11, when a process owning a window is completely obscured or minimized, the operating system has the potential to disregard any requests for timer resolution, thereby providing no assurance of a resolution exceeding the system's default setting. That would explain why it only happened when the window was minimized or when overlapped by something else. Although It could impact the performance by affecting the overall system performance and resource allocation, AI processing typically rely on numerical computations rather than time-based events (as far as my understanding gets), so that's why I'm really not sure about it. Disabling the rotating circle didn't seem to fix it, however running a commandline with koboldcpp.exe --model model.bin --threads 4 --stream --highpriority --smartcontext --blasbatchsize 1024 --blasthreads 4 --useclblast 0 0 --gpulayers 8 seemed to fix the problem and now generation does not slow down or stop if the console window is minimized or overlaid.

LostRuins commented 1 year ago

Okay so to confirm the earlier issue - the slowdown was not caused by the browser, rather by the console itself being minimized? Or does it only lag when the browser was open?

Khrspk commented 1 year ago

Rather by the console itself being minimized or overlapped by any other window. If indeed this is caused by this windows 11 thing I mentioned (Wich i'm really not sure) it can be fixed by disabling PROCESS_POWER_THROTTLING_IGNORE_TIMER_RESOLUTION for koboldcpp.exe when it's running. But it would probably need to include the windows 11 SDK. But i'm not sure if it's really needed since the commandline seems to already fix it.

gustrd commented 1 year ago

I experienced an issue quite alike the one you've described, and upon some investigation, I realized that the root cause was related to the Windows Power Settings. More specifically, when in 'Balanced' mode, which is set as the default, it becomes crucial to select the "Power Mode: Best Performance" option.

If you don't do so, you might notice a substantial slowdown in the background consoles. Thus, I'd recommend checking and adjusting your power settings for potentially improved performance.

image

Khrspk commented 1 year ago

In fact my power settings were already set to best performance so I don't think that was the cause for me.

Khrspk commented 1 year ago

Update: Although the command line "soft fixed" it, I realized the generation time was still a bit slower when it was overlapped by any window, so i checked all windows configs and discovered it was Windows Game mode limiting its resources when it was running on background, so I think disabling this is the real fix for it.

LostRuins commented 1 year ago

Cool. I have game mode always disabled so I never notice such stuff haha. But hopefully it is useful to others.