High CPU consumption - Githubissues

khoan commented 10 years ago

When queue is banked up with no other jobs except throttled ones, CPU spikes to 99%. My guess is fetcher keeps on popping throttled jobs off the queue, but only to push back on since limit is reached.

We can't simply sleep until next rate limit cycle begins, since other non-throttled job can be queued and processed.

Any thought on how to prevent CPU spike as described?

khoan commented 10 years ago

Interestingly, once queue is empty, CPU doesn't spike. So sidekiq is implementing some sort of sleep algo, when queue is empty.

bwthomas commented 10 years ago

First: sorry for the interminably slow response.

Second: It's a good question. It's definitely entirely on sidekiq-rate-limiter, as it does pop things off the queue, check the limit, & then put them back.

I'll give it some thought. There may be a better way, though I think your observation that sleeping won't work is spot on since the utility is in the limiting being configurable per job / worker.

sentience commented 10 years ago

I’ve thought about this a little, and I wonder if adopting an element of how https://github.com/gevans/sidekiq-throttler works could solve this. Rather than rate-limiting by pushing the job back onto the “live” queue (which Sidekiq pulls from as quickly as it can), we could look at pushing rate-limited queues back as scheduled jobs (which Sidekiq only checks every 15 seconds), which would show up in the Delayed count on the Sidekiq web UI.

Thoughts?

bwthomas commented 10 years ago

So, this gem came out of my desire to avoid that solution actually :) What happens there is they keep getting pushed into scheduled, but for whatever your period is. So if you want to use a short period (say, 1 minute) it schedules all your jobs past the limit for 1 minute later. So you wind up with the same high cpu usage, the only difference is that your stats wind up in the millions for any sufficiently high amount of traffic.

I think the best solution would be to actually queue them to a separate queue that's transparent to any clients, & then using some utility algorithm move them onto the actual queue so that there's always a little more work than you want.

Perhaps it would be advantageous to support multiple throttling strategies which could be configured?

sentience commented 10 years ago

For the inflated statistics, I believe that because we’d be putting the jobs into the scheduled queue in our custom fetcher, rather than doing it in a Sidekiq middleware when the jobs are run (which is what sidekiq-throttler does), we’d avoid this issue. Sidekiq would not be counting throttled jobs as done, because they would never “officially” be fetched and run.

With a large number of rate-limited jobs, the CPU usage would still be high, yes, but it would only occur in bursts, occurring each time the jobs came off the scheduled job queue (i.e. once per rate interval). Even with a short interval (e.g. 1 minute), that would mean just a burst of CPU usage every 1 minute, rather than it being continuous usage.

The only downside I see to this approach is that Sidekiq polls the scheduled jobs queue only once every 15 seconds, so rate intervals on the order of a minute or less would suffer from some significant latency (up to 15 seconds) while Sidekiq gets around to pulling them off the scheduled job queue. For me, 15 seconds give or take is not significant, but it might be in some applications.

Am I missing anything else?

heaven commented 8 years ago

@sentience you are missing sidekiq-throttler.

@bwthomas I believe the whole concept this gem is using is wrong. I'd like to have an ability to rate limit queues rather than jobs. Once the limit reached we simply need to stop monitoring the queue until the lock expires.

There's a @queues array on which Sidekiq is calling a blocking RPOP. So once a job has been finished Sidekiq calls BRPOP to fetch a new job from one of these queues. This operation is blocking and will hold the connection until there's a job available. This allows Sidekiq to avoid polling.

I was thinking we can track the number of jobs performed for a certain queue per the given period of time (limit). When the limit for a queue has been reached we need to temporarily remove it from the @queues array and store in Sidekiq separately, e.g. queues:exhausted:#{queue_name}, with the expiration time to release the lock. Also we need to start a polling thread which will check for this queue lock to expiry and return the queue back to the @queues array. This thread could be started just once during the startup, if there's at least one throttled queue.

Additionally we'll have to restart the fetcher that is already waiting for new work with BRPOP, after removing/re-adding queues from/to @queues array.

UPD: I was hoping there's a blocking KEY – BKEY, which will release the lock when the key expires, but it seems there's no such operation: https://github.com/antirez/redis/issues/2072 It would be really helpful in this case and allow us to add our exhausted queue to queues:exhausted:and run a BKEY on it, so we don't have to poll. A side effect of this would be that we'll have to start as many polling threads as many throttled queues we have, due to a locking nature of BKEY operation. I am not an Redis expert, the author of that issue told he found a way how to achieve the same result using existing features.

bwthomas commented 8 years ago

It may not be ideal @heaven, but don't forget that my goal is to rate limit because of constraints imposed by the vendor's api. How we get there makes me no difference.

I still think the best solution is to offer multiple pluggable throttling strategies. Pull requests welcome :)

heaven commented 8 years ago

100% CPU usage is a huge difference and there's no way to avoid Sidekiq from popping jobs from exhausted queues unless you explicitly tell it to not to.

bwthomas commented 8 years ago

The same usage was happening with sidekiq-throttler, just through the scheduled queue.

Still, that's not good. So, I reiterate: pull requests welcome.

enova / sidekiq-rate-limiter

High CPU consumption #2