arturictus / sidekiq_alive

Liveness probe for Sidekiq in Kubernetes deployments
MIT License
189 stars 57 forks source link

sidekiq_alive causes Redis CPU to max out on big deployments #69

Closed Fire-Dragon-DoL closed 1 year ago

Fire-Dragon-DoL commented 1 year ago

Hi! We recently performed an upgrade from v2.1.4 to v2.1.5 that caused a system outage: on boot, Sidekiq maxed out Redis CPU. The machine where Redis is hosted usually doesn't see more than 10% CPU usage.

The investigation for the causes is still in progress, the code that is potentially causing the issue was contributed by our company to sidekiq_alive, so we caused our own problem, but it's important to consider that other big deployments (million jobs) might face the same issue, the version increase should probably be at least MINOR, not PATCH, given the kind of outage it caused.

I will be investigating in how this happened, so I will be able to provide more details.

arturictus commented 1 year ago

@Fire-Dragon-DoL thanks for letting me know. Any news?

Fire-Dragon-DoL commented 1 year ago

@Fire-Dragon-DoL thanks for letting me know. Any news?

Hi @arturictus , I don't have "direct" news, the investigation is still going, we are waiting for a replica of production Redis with data scrambled to replicate the behavior. I gathered some numbers and the software I work on processes at least 10M jobs per day, which suggests the scan with 1000 count might be the problem, but this is still an hypothesis.

I'll keep you posted, the ticket is ongoing.

Fire-Dragon-DoL commented 1 year ago

@arturictus I was able to track what was going on!

The problem is the following method: https://github.com/arturictus/sidekiq_alive/blob/5bee57c21f1762f506d73aba7a15387c3e1f51b3/lib/sidekiq_alive.rb#L64-L66

This method is called at: https://github.com/arturictus/sidekiq_alive/blob/5bee57c21f1762f506d73aba7a15387c3e1f51b3/lib/sidekiq_alive.rb#L137-L139

Which happens synchronously, at boot. Now, scan is not synchronous, but called in a tight ruby loop will behave similarly to keys *, essentially iterating over all the keys of the database. We use a sidekiq extension that write locks as keys in Redis, so there are more than 3M keys in this deployment.

Writing that message causes the boot time to increase from a ~30 seconds to over 5 minutes.

Changing the approach could be helpful in this context.

I noticed also this method that could be problematic: https://github.com/arturictus/sidekiq_alive/blob/5bee57c21f1762f506d73aba7a15387c3e1f51b3/lib/sidekiq_alive/worker.rb#L18-L22

However it seems to be used only in tests and as such, not affecting anything.

The fast solution is to inform that boot has happened without listing those keys. The "maybe right" solution would be to use a single key instead of a group of keys.

SCAN with a pattern is still O(n) operation, the filtering happens after the results are fetched, before being returned to the client.

Edit:

module SidekiqAlive
  class << self
    def successful_startup_text
      "Successfully started sidekiq-alive, registered instances: <hidden>"
    end
  end
end

This monkey patch speeds up boot time by 5 minutes (bringing it down to a few seconds)

Fire-Dragon-DoL commented 1 year ago

Testing this locally right now, will close if it works

Fire-Dragon-DoL commented 1 year ago

Perfect :+1: