Closed suwyn closed 9 months ago
Hi @suwyn,
Thanks for reaching out!
break if stop?
check:This more closely mirrors how delayed
responds to a SIGINT
or SIGTERM
, by attempting to finish the current set of jobs assigned to the thread pool before exiting. (I'm simulating that with 3 "jobs," repeated forever by the "worker" in a loop.)
SIGINT
(Ctrl+C) sent during the second set of jobs:It loops around to pick up more jobs until it receives SIGINT
/SIGTERM
, at which point it will finish the current pool and exit cleanly instead of picking up more jobs. However, if it receives a SIGKILL
instead, it will immediately exit, without attempting to clean anything up. (This signal is special and is handled by the kernel/OS -- the worker never actually receives it so there would be no way for it to react.)
What's the recommended way to gracefully shut down a running job when the process is terminated?
Right, so to answer your question, the way to gracefully shut down is to first send a SIGTERM
, give jobs a chance to finish, and then send a SIGKILL
later if/when you need to fully stop the process. If you're not seeing SIGTERM
produce a clean exit within a reasonable amount of time, it's likely due to long-running jobs. So, this means that jobs need to be short-lived enough to complete gracefully within that waiting period, otherwise the worker will exit ungracefully and the new workers will wait until locked_at + Delayed::Worker.max_run_time
in order to be sure that no other worker is running the job.
In general, I'd suggest deconstructing jobs into shorter units of work, but keep in mind that once you send the SIGTERM
, you know that the worker won't pick up any new jobs, so—depending on your deployment infrastructure—you could wait a very long time before sending a SIGKILL
! (Perhaps even the entire max_run_time
, at which point you'll know for sure that all jobs have either completed or timed-out-with-cleanup.)
Should a job implement its own traps or is there a hook that Delayed offers?
I hadn't really considered this before. Generally I think we've found that keeping max_run_time
configured to the default of 20 min (or less)—and waiting after SIGTERM
for jobs to finish gracefully—has produced the best overall results (in addition to making sure that jobs are all idempotent & re-runnable). Even if a worker can't exit gracefully due to a long-running job, the longest we'd have to wait for that job to be picked up again is 20 minutes (and since this typically only affects jobs that take a long time to complete, we don't expect fast turnaround anyways).
Thanks for the explanation @smudge it all make sense.
You're correct in that the culprit for us is a long running job and that it should be broken down into shorter units of work and to keep max_run_time
sensible. We had been delaying that work and kept bumping the max_run_time
up to compensate :fearful: and given that we're cleaning up containers 60 seconds after a deploy those jobs are staying locked/stalled for up to 3hours before they get cleared and picked up again.
While we do have an option to trap
the SIGTERM
in the job itself, it would feel cleaner if that came from the Job API (e.g. perhaps rescue_from
) but I find myself agreeing with you again here and want to avoid that complexity.
tl;dr - We'll decompose our long running job so that it runs in multiple units, allowing us to keep max_run_time
low enough to allow containers to be cleaned up in a timely manner. Thanks!
Our jobs are remaining locked after the worker process receives the
SIGKILL
(in our case from docker stop after the grace period elapses).I'm not sure if this is by design or not, the README states (may being the keyword):
In my tests they always remains locked when a job is running and Delayed also doesn't terminate with a
SIGTERM
.Here is simple rake task I used to simulate the issue from Delayed
``` class TestWorker def start trap('TERM') { quit! } trap('INT') { quit! } 500.times do |i| puts "Run #{i}" sleep 1.second end ensure on_exit! end def quit! puts 'quit!' end def on_exit! puts 'on_exit!' end end namespace :test do desc "Test signal interupts" task work: :environment do TestWorker.new.start end end ```If you run that as as a rake task, it won't terminate on a
SIGTERM
, onlySIGKILL
which won't execute theensure
block which in Delayed is what unlocks the jobs.Whereas if I explicitly `exit` in the `quit!` method it works as expected.
``` class TestWorker def start trap('TERM') { quit! } trap('INT') { quit! } 500.times do |i| puts "Run #{i}" sleep 1.second end ensure on_exit! end def quit! puts 'quit!' exit end def on_exit! puts 'on_exit!' end end namespace :test do desc "Test signal interupts" task work: :environment do TestWorker.new.start end end ```What's the recommended way to gracefully shut down a running job when the process is terminated? Should a job implement its own traps or is there a hook that Delayed offers?