arturictus / sidekiq_alive

Liveness probe for Sidekiq in Kubernetes deployments
MIT License
188 stars 57 forks source link

Worker goes unhealthy after 10 minutes on sidekiq 7 #94

Open w32-blaster opened 1 year ago

w32-blaster commented 1 year ago

We are trying to upgrade our infrastructure to use sidekiq 7 and facing the following issue: sidekiq_alive works fine when registering the worker in redis, then it becomes healthy. After exactly 10 minutes it becomes unhealthy with this message: "Can't find the alive key" and pod gets restarted. We have verified and during this 10 minutes period we can see it really alive, so it's not like our health check starts working after 10 minutes. Any idea where to look to solve this problem?

arturictus commented 1 year ago

Hi @w32-blaster , Is the sidekiq-alive queue running? Can you check if it shows in the interface?

w32-blaster commented 1 year ago

@arturictus do you mean to check it in the sidekiq UI?

arturictus commented 1 year ago

@w32-blaster, yes

rakibulislam commented 11 months ago

Hi @arturictus Yes, sidekiq-alive queue is running and we can see that in the sidekiq UI. Please let us know if you have any clue on why this might be happening with this sidekiq-alive gem when used with sidekiq 7.1. Just another quick info: we also use sidekiq-cron since this issue is happening every 10 minutes, would the sidekiq-cron gem play any role here somehow? Please let us know if you can think of anything to solve this issue. Thanks! cc: @w32-blaster

andrcuns commented 11 months ago

I am also using it with sidekiq 7.1 together with sidekiq-cron and it seems to work fine.

One problem that can happen is if your cron jobs spawn enough jobs that take all available threads, alive job will not get picked up and app will start reporting as unhealthy. I encountered such issue a while back in my project here: https://gitlab.com/dependabot-gitlab/dependabot/-/issues/293#note_1408148243

This is not really solvable in sidekiq < 7 but capsules in the new version should be able to address that. Potential fix: https://github.com/arturictus/sidekiq_alive/pull/96

w32-blaster commented 11 months ago

That's not the case. Our worker threads are sitting there and doing nothing. Then after 10 minutes they got removed because the health check.

andrcuns commented 11 months ago

You should try and monitor the sidekiq scheduled jobs page and see if it consistently has alive job scheduled: image

Given the default ttl value is 10 minutes, it basically means that sidekiq-alive is not working at all. You can change the option config.time_to_live = 10 * 60 in the config to validate that.

rakibulislam commented 11 months ago

You should try and monitor the sidekiq scheduled jobs page and see if it consistently has alive job scheduled: image

Given the default ttl value is 10 minutes, it basically means that sidekiq-alive is not working at all. You can change the option config.time_to_live = 10 * 60 in the config to validate that.

@andrcuns You are right. I just checked and verified this: sidekiq-alive queue is not scheduled in staging (where we deployed sidekiq 7), but it's present in other environments where we still have sidekiq 6.5.6.

Now, could you please help us a bit debug where is the actuall issue? Here are list of gems and their versions we are using along with sidekiq 7.1.2: sidekiq-cron-1.10.1, sidekiq-failures-1.0.4, sidekiq-limit_fetch-4.4.1, sidekiq-unique-jobs-8.0.3, sidekiq_alive-2.2.3. We are also using redis-5.0.7 and hiredis-client-0.16.0. Are you using hiredis-client as well?

@andrcuns Any clue on how to debug this issue would be highly appreciated.

andrcuns commented 11 months ago

I am not using hiredis so can't really comment on that but it's possible it is somehow at fault here and is maybe causing issues for native redis commands to be executed.

You can start with checking the sidekiq startup logs and see sidekiq-alive is even started correctly. You should have something like this:

│ [2023-09-01 08:08:17 +0000] INFO  -- Starting sidekiq-alive: {:hostname=>"dependabot-gitlab-worker-fc46db7c4-kpdwp", :port=>"7433", :ttl=>60, :queue=>"healthcheck-dependabot-gitlab-worker-fc46db7c4-kpdwp", :reg │
│ ister_set=>"sidekiq-alive-hostnames", :liveness_key=>"SIDEKIQ::LIVENESS_PROBE_TIMESTAMP::dependabot-gitlab-worker-fc46db7c4-kpdwp", :register_key=>"SIDEKIQ_REGISTERED_INSTANCE::dependabot-gitlab-worker-fc46db7c │
│ 4-kpdwp"}                                                                                                                                                                                                          │
│ [2023-09-01 08:08:17 +0000] INFO  -- Successfully started sidekiq-alive, registered with key: SIDEKIQ_REGISTERED_INSTANCE::dependabot-gitlab-worker-fc46db7c4-kpdwp on set sidekiq-alive-hostnames                 │
│ [2023-09-01 08:08:18 +0000] INFO  -- WEBrick 1.8.1                                                                                                                                                                 │
│ [2023-09-01 08:08:18 +0000] INFO  -- ruby 3.1.4 (2023-03-30) [x86_64-linux]                                                                                                                                        │
│ [2023-09-01 08:08:18 +0000] INFO  -- WEBrick::HTTPServer#start: pid=7 port=7433                                                                                                                                    │
rakibulislam commented 11 months ago

I can see the sidekiq-alive is starting successfully:

{"ts":"2023-09-04T14:02:25.641Z","pid":1,"tid":"3qzh","lvl":"INFO","msg":"Starting sidekiq-alive: {:hostname=>\"website-sidekiq-564f44bf99-k29cc\", :port=>7433, :ttl=>600, :queue=>\"sidekiq-alive-website-sidekiq-564f44bf99-k29cc\", :register_set=>\"sidekiq-alive-hostnames\", :liveness_key=>\"SIDEKIQ::LIVENESS_PROBE_TIMESTAMP::website-sidekiq-564f44bf99-k29cc\", :register_key=>\"SIDEKIQ_REGISTERED_INSTANCE::website-sidekiq-564f44bf99-k29cc\"}"}
169
{"ts":"2023-09-04T14:02:25.648Z","pid":1,"tid":"3qzh","lvl":"INFO","msg":"Successfully started sidekiq-alive, registered with key: SIDEKIQ_REGISTERED_INSTANCE::website-sidekiq-564f44bf99-k29cc on set sidekiq-alive-hostnames"}

I will try removing hiredis and deploy, and see if that fixes the issue. Will reply back once I test it out. Thanks so much for your help thius far @andrcuns :-)

andrcuns commented 11 months ago

If it starts successfully, the issue is then either with the job not being scheduled which is supposed to update alive key, the job being scheduled but being unsuccessful at updating the key. I believe it's one of these two.

The fact that it starts failing only after 10 minutes + the log shows that it creates liveness key successfully.

rakibulislam commented 11 months ago

@andrcuns So, I tried removing hiredis client but still the issue persists. I can see the sidekiq alive queue here: https://app.com.com/sidekiq/queues but it does not appear in the scheduled tab: https://app.com/sidekiq/scheduled that means the job is not being scheduled. Do you have any idea where I can look deep into in order to see what's the reason the job is not being scheduled? Thank you once again!

andrcuns commented 11 months ago

@rakibulislam Not really sure. I imagine you can try enabling debug logging for sidekiq and see if that reveals anything.

We could probably use some additional debug log messages in this repo as well, like actions being performed during startup and inside worker.

If it's not getting scheduled, I'm guessing it means either this line did not schedule the initial job or it did but this action did not trigger the next job

rakibulislam commented 11 months ago

@andrcuns I was debugging this locally and found out that, this line is executed on sidekiq startup. But, this line is not executed automatically. And, I think that's the main issue we need to figure out. But, when I run this command directly from the rails console: SidekiqAlive::Worker.perform_async(hostname) then, it runs but gets this error ConnectionPool::TimeoutError:

{"ts":"2023-09-13T04:33:33.682Z","pid":45635,"tid":"3f5v","lvl":"INFO","msg":"start","ctx":{"class":"SidekiqAlive::Worker","jid":"6e47fd68626fdb61325b0445"}}
SidekiqAlive::Worker.perform begins: 1
SidekiqAlive::Worker.perform begins: 2
****************************************************************************************************
store_alive_key!!! current_lifeness_key: SIDEKIQ::LIVENESS_PROBE_TIMESTAMP::HOSTNAME_NOT_SET
****************************************************************************************************
{"ts":"2023-09-13T04:33:34.791Z","pid":45635,"tid":"3f5v","lvl":"INFO","msg":"fail","ctx":{"class":"SidekiqAlive::Worker","jid":"6e47fd68626fdb61325b0445","elapsed":1.109}}
{"ts":"2023-09-13T04:33:34.792Z","pid":45635,"tid":"3f5v","lvl":"WARN","msg":"{\"context\":\"Job raised exception\",\"job\":{\"retry\":false,\"queue\":\"default\",\"args\":[\"HOSTNAME_NOT_SET\"],\"class\":\"SidekiqAlive::Worker\",\"jid\":\"6e47fd68626fdb61325b0445\",\"created_at\":1694579613.643147,\"enqueued_at\":1694579613.646084,\"error_message\":\"Waited 1 sec, 0/2 available\",\"error_class\":\"ConnectionPool::TimeoutError\",\"processor\":\"Ks-MacBook-Pro.local:45635\",\"failed_at\":1694579614.7854972},\"_config\":{\"labels\":\"#<Set: {}>\",\"require\":\".\",\"environment\":\"development\",\"concurrency\":5,\"timeout\":25,\"poll_interval_average\":null,\"average_scheduled_poll_interval\":5,\"on_complex_arguments\":\"raise\",\"error_handlers\":[\"#<Proc:0x0000000105925350 /Users/k.m.rakibulislam/.asdf/installs/ruby/3.2.2/lib/ruby/gems/3.2.0/gems/sidekiq-7.1.2/lib/sidekiq/config.rb:37 (lambda)>\",\"#<Raven::SidekiqErrorHandler:0x0000000108595d18>\"],\"death_handlers\":[\"#<Proc:0x00000001096d8170 /Users/k.m.rakibulislam/Development/Yoshi/website/config/initializers/sidekiq.rb:9 (lambda)>\",\"#<Proc:0x000000010a6c92c8 /Users/k.m.rakibulislam/.asdf/installs/ruby/3.2.2/lib/ruby/gems/3.2.0/gems/sidekiq-unique-jobs-8.0.3/lib/sidekiq_unique_jobs/server.rb:8 (lambda)>\"],\"lifecycle_events\":{\"startup\":[],\"quiet\":[\"#<Proc:0x000000010a34a070 /Users/k.m.rakibulislam/.asdf/installs/ruby/3.2.2/lib/ruby/gems/3.2.0/gems/sidekiq_alive-2.3.0/lib/sidekiq_alive.rb:45>\"],\"shutdown\":[\"#<Proc:0x000000010a34a048 /Users/k.m.rakibulislam/.asdf/installs/ruby/3.2.2/lib/ruby/gems/3.2.0/gems/sidekiq_alive-2.3.0/lib/sidekiq_alive.rb:50>\",\"#<Proc:0x00000001096d7b08 /Users/k.m.rakibulislam/.asdf/installs/ruby/3.2.2/lib/ruby/gems/3.2.0/gems/sidekiq-unique-jobs-8.0.3/lib/sidekiq_unique_jobs/server.rb:21>\"],\"heartbeat\":[],\"beat\":[\"#<Proc:0x000000010546b328 /Users/k.m.rakibulislam/.asdf/installs/ruby/3.2.2/lib/ruby/gems/3.2.0/gems/sidekiq-7.1.2/lib/sidekiq/metrics/tracking.rb:133>\"]},\"dead_max_jobs\":10000,\"dead_timeout_in_seconds\":15552000,\"reloader\":\"#<Sidekiq::Rails::Reloader @app=Website::Application>\",\"backtrace_cleaner\":\"#<Proc:0x000000010a5693d8 /Users/k.m.rakibulislam/.asdf/installs/ruby/3.2.2/lib/ruby/gems/3.2.0/gems/sidekiq-7.1.2/lib/sidekiq/rails.rb:59 (lambda)>\",\"pidfile\":\"tmp/pids/sidekiq.pid\",\"beta\":{\"concurrency\":5},\"staging\":{\"concurrency\":5},\"production\":{\"concurrency\":20},\"queues\":[\"charging\",\"high_pri\",\"order_positioning\",\"delivery_positioning\",\"membership_renew\",\"messaging\",\"mailers\",\"default\",\"low_pri\",\"set_first_effective_eta\",\"elasticsearch\"],\"limits\":{\"delivery_positioning\":1},\"logfile\":\"log/sidekiq.log\",\"config_file\":\"./config/sidekiq.yml\",\"tag\":\"website\",\"identity\":\"Ks-MacBook-Pro.local:45635:cd297817bf8b\",\"cron_poll_interval\":30,\"fetch_class\":\"Sidekiq::LimitFetch\"}}"}
{"ts":"2023-09-13T04:33:34.793Z","pid":45635,"tid":"3f5v","lvl":"WARN","msg":"ConnectionPool::TimeoutError: Waited 1 sec, 0/2 available"}

Do you see why this might be happening? Thanks for your time and help!

rakibulislam commented 11 months ago

@andrcuns Looks like its enqueuing the SidekiqAlive::Worker job but it's not being processed so nothing is scheduled. But, if I run the same command: SidekiqAlive::Worker.perform_async(hostname) from the rails console, it does enqueue and then gets processed and hence it shows up in the scheduled tab in the sidekiq UI. Any thoughts on what could be happening here?

Javierchik commented 11 months ago

Hey Team! Can't find the alive key message is also showing to me in an Azure web app (service) that runs Sidekiq 6.5.9 I was able to reproduce the exact same behavior, unhealthy after 10 minutes, in my local. It looks to me that somehow Sidekiq does not get the reference to the new queue that is created after 10min (default is 10 min)

Screenshot 2023-09-15 at 8 09 10 AM

Appreciate any help!

rakibulislam commented 11 months ago

Okay, I fixed this issue, finally! For me everything was working fine with sidekiq 6.5.6 but the sidekiq-alive stopped processing and scheduling jobs after we upgraded to sidekiq 7.1. Finally, the solution was to add the sidekiq-alive queue in the sidekiq.yml file and I defined the queue name this way: <%= "sidekiq-alive-#{ENV["HOSTNAME"] || 'HOSTNAME_NOT_SET'}" %>. After this, sidekiq-alive started working as expected with sidekiq 7.1!

andrcuns commented 11 months ago

That's not really a fix though, queue should be created automatically. In the latest version with sidekiq 7+ it uses a separate capsule as well.

It would be nice to understand why the queue was not there.

rakibulislam commented 11 months ago

@andrcuns I agree with you. It would be great to understand why it is not creating the sidekiq-alive queue automatically with sidekiq 7.1. Probably creating a new capsule might help? I will explore more in the future. But, the fix works for us for now.

w32-blaster commented 11 months ago

This really sound like a workaround to me, but @rakibulislam did a great work to figure this out. Wondering if a good approach and next step would be to add debug level logs to sidekiq-alive, so the actual root cause can be found or at least narrowed down.

Javierchik commented 11 months ago

Workaround from @rakibulislam also worked for me and now Sidekiq Azure web container has been Alive! after the 10 minutes threshold. This queue is now in scheduled tab and it refreshes every 5 min.

Thank you.

andrcuns commented 11 months ago

Yeah, some additional debug level messages wouldn't hurt. I'll see if I can add a pr sometime this week

arturictus commented 10 months ago

Hi there, Trying to figure out here what could be the cause.

  1. queues not being added automatically: In @Javierchik picture looks like the queues are set: https://github.com/arturictus/sidekiq_alive/issues/94#issuecomment-1721259607 But there are still doubts about that. QUESTION: could please confirm if the queues are listed in web or in the instance? HELP: could you queue a job in that queue and test if that queue executes the jobs?, even better if is not the SidekiqAlive::Worker

  2. Hostname not set: In case you are running multiple instances they will have the same redis key which will cause the opposite effect, return alive with blocked or inactive instances.

This library is intended to run with the HOSTNAME present and without it will produce false positives

  1. Worker not being scheduled: Looks improbable since these logs show success and the queuing is done before success: https://github.com/arturictus/sidekiq_alive/issues/94#issuecomment-1705354958

  2. Error in job connection pool:

    • [ ] We should retry the job in case this happens -> 🗒️
  3. Suspicion : Maybe sidekiq 7 loads/copies the queues in memory before we run on(:startup) effectively not running our queue but being visible on the web Best solution I can think of right now to that is the sidekiq.yml option that @rakibulislam proposed here. But is a workaround more than a fix.

  4. TODO:

    • [ ] Review the capsule functionality just in case we missed something in the implementation.

QUESTION: are you running redis in cluster? or single instance?, replicas?

w32-blaster commented 10 months ago

@arturictus we are running redis in Elasticache as a single instance. Not in cluster mode. There are no replicas currently

breiko83 commented 10 months ago

Facing the same issue here. The 10 mins mark is actually based on the default TTL for Redis key in the config file:

  # ==> Time to live
  # Time for the key to be kept by Redis.
  # Here is where you can set de periodicity that the Sidekiq has to probe it is working
  # Time unit: seconds
  # default: 10 * 60 # 10 minutes

I reduced this to 60 and the worker goes unhealthy after 1 minute (Can't find the alive key) - even though the worker is fine.

+++ I had a look at Redis and the key SIDEKIQ::LIVENESS_PROBE_TIMESTAMP::MY-HOST is set correctly once Sidekiq starts but it is never updated. So after 10 minutes (or whatever TTL is set) the key gets removed (expired) and webrick returns "Can't find the alive key"

amkisko commented 10 months ago

Confirming the issue, exactly after 10 minutes worker gets unhealthy status, so basically sidekiq_alive returns "not alive", after that sidekiq starts failing with ConnectionPool::TimeoutError: Waited 1 sec, 0/2 available which seems to be relevant to missing workers.

ENV["HOSTNAME"] is present and available to both sidekiq and sidekiq_alive.

You should try and monitor the sidekiq scheduled jobs page and see if it consistently has alive job scheduled.

sidekiq_alive job was not present in scheduled jobs. During another run job is present. 🤔 I don't see any pattern yet.


What I just saw: the job was still scheduled, but sidekiq worker was already terminated by provisioner. One option is to increase failureThreshold within kubernetes pod setup if you use kube, so sidekiq_alive job will have a bit more time.

Another option is to get rid of livenessProbe completely temporary.

UPD: So, tuning kubernetes pod settings helped. sidekiq_alive queue is present, job is enqueued, status updated without issues.


UPD 2: 2023-10-06T07:35:07.588Z pid=16 tid=1y2w ERROR: Can't find the alive key - that's the source of issue.


UPD 3: I checked manually in Rails console this:

SidekiqAlive.hostname
SidekiqAlive.alive?
SidekiqAlive::Worker.perform_sync

After running it manually pod became alive and basically it means that SidekiqAlive::Worker had never run before I called it manually. What makes job not run automatically at start?

Now I also see two jobs enqueued for SidekiqAlive::Worker. After waiting 10 minutes, there are still two jobs in the queue and webrick returns "Can't find the alive key". Something happens during processing of the job, waiting longer does not affect the situation at all, manually adding new job worker for sure resolves it.

Also spotted that hostname parameter is not used in the worker's perform method, so be aware about this one if you are trying to call worker on the other instance, it will take instance's hostname.


UPD 5: Worker job gets properly scheduled when manually called, so it goes under "Scheduled jobs" and waiting time is 5 minutes (600/2 as it follows the code). At some point job gets stuck under "Enqueued jobs" and never gets back to "Scheduled jobs".

2023-10-07T11:23:41.208Z pid=1 tid=1ydx class=SidekiqAlive::Worker jid=c517b1a855d55e53bac31816 INFO: start
2023-10-07T11:23:42.213Z pid=1 tid=1ydx class=SidekiqAlive::Worker jid=c517b1a855d55e53bac31816 elapsed=1.005 INFO: fail
2023-10-07T11:23:42.214Z pid=1 tid=1ydx WARN: {"context":"Job raised exception","job":{"retry":false,"queue":"default","class":"SidekiqAlive::Worker","args":["app-staging-sidekiq-abcdef"],"jid":"c517b1a855d55e53bac31816","created_at":1696677518.307089,"enqueued_at":1696677821.2067275}}
2023-10-07T11:23:42.214Z pid=1 tid=1ydx WARN: ConnectionPool::TimeoutError: Waited 1 sec, 0/2 available

UPD 6: Most probably will just go with simple implementation of web server like this:

# PATH: config/initializers/sidekiq_alive.rb

class SidekiqAliveServer
  def run!
    handler = Rack::Handler.get("webrick")
    Signal.trap("TERM") { handler.shutdown }
    handler.run(self, Port: 7433, Host: "0.0.0.0", AccessLog: [], Logger: Rails.logger)
  end
  def call(_env)
    workers = Sidekiq::Workers.new
    process_set = Sidekiq::ProcessSet.new
    process_set_size = process_set.size
    queues = Sidekiq::Queue.all
    queues_size = queues.size
    queues_avg_latency = queues.sum(&:latency) / queues_size if queues_size.positive?
    response = {
      workers_size: workers.size,
      process_set_size: process_set_size,
      queues_size: queues_size,
      queues_avg_latency: queues_avg_latency
    }
    is_alive = process_set_size.positive? && queues_size.positive? && queues_avg_latency < 30 * 60
    response_code = is_alive ? 200 : 500
    [response_code, { "Content-Type" => "application/json" }, [response.to_json]]
  end
end

Sidekiq.configure_server do |config|
  config.on(:startup) do
    @server_pid = fork { SidekiqAliveServer.new.run! }
  end

  config.on(:quiet) do
  end

  config.on(:shutdown) do
    Process.kill("TERM", @server_pid) unless @server_pid.nil?
    Process.wait(@server_pid) unless @server_pid.nil?
  end
end

Latest version is here: https://gist.github.com/amkisko/95662a67da9c7344ee538786ed3e9d6e

jasondoc3 commented 10 months ago

I'm not sure if this will help anyone else, but I was also experiencing this problem with sidekiq 6.5.7. After some deep debugging, I discovered that the issue for me was actually the sidekiq-throttled gem.

On the throttled README, it states:

Load order can be an issue if you are using other Sidekiq plugins and/or middleware. To prevent any problems, add the .setup! call to the bottom of your init file.

Because sidekiq_alive uses an on(:startup) lifecycle event to modify the configuration, I modified my sidekiq initializer to setup the throttled gem in a separate on(:startup) lifecycle event to correct the load order:

# config/initializers/sidekiq.rb
require "sidekiq/throttled"
Sidekiq.configure_server do |config|
  config.on(:startup) { Sidekiq::Throttled.setup! }
end
mleiseca commented 6 months ago

@amkisko I was having the same issue as your comment here

  1. SidekiqAlive.start executes and calls SidekiqAlive::Worker.perform_async
  2. When SidekiqAlive::Worker.perform executes, it calls store_alive_key. This fails with
    ConnectionPool::TimeoutError: Waited 1 sec, 0/2 available

Fix:

In Sidekiq.configure_server add a section that creates a capsule for SidekiqAlive

Sidekiq.configure_server do |config|

  config.capsule(SidekiqAlive::CAPSULE_NAME) do |cap|
    cap.concurrency = 3
    hostname = ENV["HOSTNAME"] || "HOSTNAME_NOT_SET"
    cap.queues = ["sidekiq-alive-#{hostname}"]
  end
end

I don't really know anything about Sidekiq internals, but it seems like it's creating a Redis connection pool that's too small - somehow there are only two connections, but the pool is exhausted when this method is called:

    def store_alive_key
      redis.set(current_lifeness_key, time: Time.now.to_i, ex: config.time_to_live.to_i)
    end

setting cap.concurrency = 3 makes a slightly larger connection pool just for sidekiq alive

stefanoc commented 1 month ago

@jasondoc3 we were having the exact same problem with Sidekq 6.5.11 and luckily I found your comment. Thanks! :-)