LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with a KoboldAI UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.41k stars 319 forks source link

[Feature Request] Horde generation should wait to poll for jobs until idle #742

Closed strikaco closed 2 months ago

strikaco commented 4 months ago

I'm using a local kobold instance on a beefy server (for work). Most of its time is spent idle. It's an ideal candidate for the Horde.

I've configured it to run Horde jobs, but now every second I'm not using it, it's trying to pull new jobs, and I'm frequently waiting for Horde jobs to finish. I only use this thing once or twice a week, but when I do, I need its full attention.

It would be great if kobold would wait 5 minutes after the end of the last non-Horde job before it starts aggressively polling the Horde for more work.

Great project, thanks for your efforts!

franshst commented 3 months ago

Have you tried giving the horde job a low priority, using linux nice command?

strikaco commented 3 months ago

Have you tried giving the horde job a low priority, using linux nice command?

Thanks, but that's a solution for a different issue. I don't want the process itself to have low priority (kobold crunching is all this machine does), my wish is that new Horde jobs would temporarily be blocked when jobs are passed to the local kobold instance. It's about user prioritization, not process prioritization.

Basically kick it into maintenance mode and put it back on the cloud once no local jobs have come in for x minutes.

henk717 commented 3 months ago

I dont think that fits with the nature of horde. The queue would be to long and unreliable. When people host a worker we expect resources to be available, if they need the PC for something else we expect them to turn it off for the time being.

One job every 5 minutes would hurt its ecosystem more than turning your worker off if you are the only worker for that model.

If you submit a job of your own in the multiuser mose it will get processed next instead of horde.

LostRuins commented 3 months ago

Hmm. What I can consider adding is a short delay (maybe 3-5 seconds) before polling resumes, if recent local usage is detected, allowing you to have immediately priority if you are actively using it.

If you do have very heavy usage it's better to shut off the worker until done, Ideally you shouldn't broadcast a worker that's online but unable to reliably fulfill incoming requests.

strikaco commented 3 months ago

When people host a worker we expect resources to be available, if they need the PC for something else we expect them to turn it off for the time being.

One job every 5 minutes would hurt its ecosystem more than turning your worker off if you are the only worker for that model.

Understood, but I think you're misreading me. It's not "run one job every 5 minutes," it's "run as many jobs as you can, but consider local jobs a blocking operation that takes the worker offline until it's been idle for 5 minutes." Like the maintenance mode threats for dropping too many jobs, only this is voluntary maintenance mode for local jobs.

I do get where you're coming from in trying to guarantee some amount of reliability. Maybe we have different expectations about the SLA of volunteered resources though? I don't know what sort of disruptive use case you're envisioning from (please enlighten me if there's something I'm not seeing) but my own case is that I pay to power this thing and it sits idle 99% of the time. The community is more than welcome to all of its spare cycles, if you guys can accommodate my need to use it uninterrupted for ~15-30 minutes per week. Leaving it up to me to ssh into the box and stop/start services means I'll take it offline and will forget to re-add it to the pool, resulting in wasted electricity on my end and one less worker for you.

Hmm. What I can consider adding is a short delay (maybe 3-5 seconds) before polling resumes, if recent local usage is detected, allowing you to have immediately priority if you are actively using it.

If you do have very heavy usage it's better to shut off the worker until done, Ideally you shouldn't broadcast a worker that's online but unable to reliably fulfill incoming requests.

Also understood, which is why I suggested a more-conservative 5 minutes (not seconds) offline cooldown before polling resumes. The sooner I'm done with the handful of operations I need to do, the quicker the server can be rededicated to the pool. I expected a series of short delays to be really annoying for everybody involved.

LostRuins commented 3 months ago

If you're able to build from my experimental branch, I have trialed a solution that should fit most use cases: https://github.com/LostRuins/koboldcpp/commit/aa5124439d46a165f9856f96fa2a85d9a49c3300

How it works:

LostRuins commented 3 months ago

Hello, this has now been fixed in the latest version. Please try it out.

strikaco commented 2 months ago

Hey, this works great. Retry works much better with this in place, which is a feature I had not even considered.

If it's not too much to ask, is it possible to make the 20 second delay configurable (maybe by CLI, if nothing else)? I see you anticipated chat usage which is highly interactive, but what I'm doing is copying-and-pasting CSV segments into instruct mode, which usually takes a little more than 20 seconds to retrieve data for and frame the context of the next input. Even setting a static 60 seconds (for instruct mode only?) should work.

If not, this addresses the request enough to close.

Thanks so much!

LostRuins commented 2 months ago

Hmm yeah i think for now this strikes a good balance - if you need it longer you can try edit the setting of the delay value in the koboldcpp.py file.

strikaco commented 2 months ago

I've been using it a bit more and it's already night and day difference as-is, so thanks again!