arturictus / sidekiq_alive

Liveness probe for Sidekiq in Kubernetes deployments
MIT License
189 stars 57 forks source link

Queue Priority / Ordering #31

Closed zacheryph closed 3 years ago

zacheryph commented 4 years ago

We commonly have jobs (mailers) that get pushed into the queue, thousands at a time. These some times take a bit of time to process. This can cause the alive scheduled job to not get processed. It gets queued but sidekiq_alive-XXXX is below mailers. In turn the probe fails due do the job not running.

Is/can a higher priority be set to the alive queue to ensure the worker does not get restarted because its actually getting work done :)

benlangfeld commented 4 years ago

This is also a problem if the worker is processing a number of long jobs equal to its maximum thread count; there will be no capacity left to process the health check jobs.

Indeed, if we combine the two issues, such that there is a backlog of long-running jobs when the worker process starts, it appears to start working those and fill up its capacity before the health check job is even queued, meaning that even setting a long TTL on the health check won't work because it will never be set healthy to then later expire.

limcross commented 3 years ago

I have the same problem, and I don't know if it's designed like that on purpose. I'll do a PR to propose this change, but in the meantime, you can solve it like this.

Sidekiq.configure_server do |config|
  if config.options[:queues].last == SidekiqAlive.current_queue
    config.options[:queues].prepend(config.options[:queues].pop)
  end
end
indiebrain commented 3 years ago

@limcross Does this assume that the worker should NOT process a queue where the jobs therein may take longer than the alive key TTL? It occurs to me that starvation of the sidekiq_alive queue may still be possible if the worker fills up with jobs which take longer to process than the key TTL.

And example, one application I spend quite a bit of time with is a deployment system which uses sidekiq to process deployment jobs. Some deployments of the rather large applications can take ~30 minutes to complete - which is its own problem, but still the reality of the situation. Assume our sSdekiq worker can process n jobs at a time, I think all that would be necessary in our case is that n deployment jobs would scheduled roughly around the same time, and queue up the "Alive" job behind them - in the n+1 depth of the job queue. I believe this is sufficient to cause the alive ping to fail even if we were to make sure that alive queue were the "higher-priority" queue. The deployment jobs are already in progress - the worker is already starved - by the time the alive job arrives on the high-priorty queue.

We don't want the alive key ttl to be set to something ridiculously high like 30 minutes - since that would mean we'd fail to detect a downed worker for quite a long time. Additionally, in our case deployments can consume a considerable amount of resources in our worker pods so we limit them to quite a low number of concurrent jobs - ~4 or 5 I believe.

Hope that adds some color for consideration. Does this seem like a rational parsing of how the suggested code snippet would behave in this scenario?

limcross commented 3 years ago

I'm sorry @indiebrain , but I have no solution for that issue other than increasing the key lifetime. In my case, we have rewritten some workers to split the work and reduce the time, and have extended the key lifetime for those who have not been able to split.

arturictus commented 3 years ago

Hi @indiebrain, An interesting problem to solve :). The only solution I could think of is to open a new child process to the process executing your build and keep storing the sidekiq alive key in redis while is working. This would be an Elixir approach and I think it could be done with concurrent-ruby.

The code in your workers could look something like this:

def perform(*args)
   SidekiqAlive.long_process_keep_alive() # naming can be improved
   # this will:
   # Open a new child process with a loop that will store in redis the liveness
   # This child process should be killed when/if this one stops
  do_my_deployment()
end

I'll try to investigate this approach.

zedtux commented 1 year ago

I'm joining this thread since I'm facing the same issue as @indiebrain.

Have you guys found anything that could fix / help here?

Aside note: Someone has created an issue on the Sidekiq project to request a Health/Liveness check.