contribsys / faktory_worker_go

Faktory workers for Go
Mozilla Public License 2.0
242 stars 43 forks source link

Implement hard shutdown #76

Closed mperham closed 9 months ago

mperham commented 9 months ago

FWR implements a hard shutdown timeout of 25 seconds, exactly like Sidekiq, but FWG never implemented it because contexts were not part of the original API design. A year or two later we integrated Contexts but only to provide access to helper Values.

Today FWG will pause forever, waiting for all jobs to stop. Most jobs execute quickly and finish within 30 seconds but longer jobs can delay shutdown, leading to platforms like Heroku KILLing FWG without warning, orphaning those long jobs for the reservation timeout (default: 30 minutes).

The proper and canonical shutdown process should:

  1. receive an external signal
  2. flag that shutdown is starting
  3. internal goroutines should exit any process loop/select
  4. run any registered Shutdown callbacks
  5. pause up to ShutdownTimeout for existing jobs to finish
  6. cancel job context so lingering jobs quickly die with an error
  7. FAIL any jobs in (5) so they can be re-executed soon
  8. Close connection pool and exit.

Ideally this will guarantee that FWG can gracefully stop the vast majority of jobs and FAIL any which linger past 25 seconds so they can be re-executed quickly after process restart.

This is the fix for contribsys/faktory#468

mperham commented 9 months ago

Please note: for FWG to FAIL a job, it must return an error. If your job responds to a cancelled Context by simply returning, that is considered a successful job execution.