arturictus / sidekiq_alive

Liveness probe for Sidekiq in Kubernetes deployments
MIT License
188 stars 57 forks source link

child process terminates before sidekiq jobs are finished #85

Closed matt-domsch-sp closed 6 months ago

matt-domsch-sp commented 1 year ago

Sidekiq can have jobs that are very long-running. In the standard sidekiq manager stop(deadline) method, the sidekiq shutdown handlers run (including sidekiq_alive shutdown handler to kill the child process), and then it waits for up to (deadline) seconds for all the workers to finish their jobs, which could be an arbitrarily large wait. In the meantime, anything that's calling sidekiq_alive's health check endpoint will fail as that process is now dead, indicating to the caller that the sidekiq process itself is now dead (but it's not - there's still work being done - jobs in flight) - which may then hard-terminate the container sidekiq is in.

I think a proper solution is to move the quiet and shutdown handling code into an at_exit handler instead, so the child process is terminated only immediately as sidekiq is on its way to exiting.

One problem with this approach though is that during this time (after quiet and shutdown handlers have run, but before sidekiq has exited, is that no more worker jobs are running to update the redis key. Therefore we can't return the HTTP 404 response during this time, lest that be an indication sidekiq has exited.

Thoughts?

arturictus commented 1 year ago

Idea:

Taking into consideration the current quiet/shutdown callbacks https://github.com/arturictus/sidekiq_alive/blob/66f2f3f435795da36745eb8ac95651e036b280e2/lib/sidekiq_alive.rb#L34-L45

We could store in memory, config or class variable, a @quiet boolean. If the current state is @quiet == true the server could return alive: true.

When the shutdown is triggered, the server process will terminate, not responding any more. @matt-domsch-sp, what do you think?

andrcuns commented 11 months ago

@arturictus

I just noticed, we call the shutdown callback essentially twice, is this intended? I just noticed that in the logs of my app both the callback and queue purging happens twice.