bensheldon / good_job

Multithreaded, Postgres-based, Active Job backend for Ruby on Rails.
https://goodjob-demo.herokuapp.com/
MIT License
2.53k stars 190 forks source link

How to properly handle interrupts #1343

Closed jgrau closed 2 months ago

jgrau commented 2 months ago

Hi @bensheldon

I am using Sentry Cron Monitoring and have something that roughly translates to

    around_perform do |_job, block|
      # slug and monitor configs are irrelevant for this example but they exist :)
      check_in_id = Sentry.capture_check_in(slug, :in_progress, monitor_config:)

      start = Sentry.utc_now.to_i

      begin
        block.call
        duration = Sentry.utc_now.to_i - start

        Sentry.capture_check_in(
          slug,
          :ok,
          check_in_id:,
          duration:,
          monitor_config:,
        )
      rescue StandardError
        duration = Sentry.utc_now.to_i - start

        Sentry.capture_check_in(
          slug,
          :error,
          check_in_id:,
          duration:,
          monitor_config:,
        )

        raise
      end
    end

in application_job.rb.

I have a problem when a job is interrupted while performing. The job has successfully "checked in" but since it is interrupted it never gets the chance to to update the checkin with either :ok or :error. Meanwhile, the second execution of the job (the automatic retry) performs another checkin and probably also a correct :ok update but as far as Sentry knows, the original execution never finishes and receives a "timeout".

I figured I could solve that with something along the lines of

  # Use this to control interrupts
  # https://github.com/bensheldon/good_job#Interrupts-graceful-shutdown-and-SIGKILL
  include GoodJob::ActiveJobExtensions::InterruptErrors

  # Retry the job if it is interrupted
  retry_on GoodJob::InterruptError, wait: 0, attempts: :unlimited

  around_perform do |_job, block|
      # slug and monitor configs are irrelevant for this example but they exist :)
      check_in_id = Sentry.capture_check_in(slug, :in_progress, monitor_config:)

      start = Sentry.utc_now.to_i

      begin
        block.call
        duration = Sentry.utc_now.to_i - start

        Sentry.capture_check_in(
          slug,
          :ok,
          check_in_id:,
          duration:,
          monitor_config:,
        )
    rescue GoodJob::InterruptError
      # If a job is interrupted by GoodJob it will be automatically retried
      # and as it retries it will check in with sentry again, however, without
      # this rescue block the cron checkin of the original job would never
      # receive a status update and thus be marked as exceeded the max runtime
      # (duration error).
      duration = Sentry.utc_now.to_i - start

      Sentry.capture_check_in(slug, :ok, check_in_id:, duration:)

      raise
      rescue StandardError
        duration = Sentry.utc_now.to_i - start

        Sentry.capture_check_in(
          slug,
          :error,
          check_in_id:,
          duration:,
          monitor_config:,
        )

        raise
      end
    end

But as far as I can tell, I don't get to rescue the interrupt error.

Can you tell if my proposed solution "should work" (I might have made a stupid mistake somewhere) or whether I am going down a wrong path...

bensheldon commented 2 months ago

hmmm. I just doublechecked and it seems like the retry_on handler will be called after the perform callbacks:

https://github.com/rails/rails/blob/4bb73233413f30fd7217bd7f08af44963f5832b1/activejob/lib/active_job/execution.rb#L51-L53

I'm not sure, though, in what order Active Job callbacks run. Could you try defining your around_perform before you include GoodJob::ActiveJobExtensions::InterruptErrors. That itself is an around_perform:

https://github.com/bensheldon/good_job/blob/fbd1a5b730bd735110eb23e7be72f61128f346a2/lib/good_job/active_job_extensions/interrupt_errors.rb#L8-L12

bensheldon commented 2 months ago

Aha, found it:

Calls the before and around callbacks in the order they were set

So if I'm understanding this correctly because these are wrapping, you'll want your own custom around_perform wrapper to be invoked before you include GoodJob::ActiveJobExtensions::InterruptErrors.

If that works, that might be an nice note to add to the Readme.

jgrau commented 2 months ago

I think the reordering of callbacks does allow me to rescue the interrupt error but it also seems as if it's the second execution that is raising the exception (that I can now rescue). Does that make sense? I think I need to be able to rescue (run something) when the first execution is shutting down... 🤔

bensheldon commented 2 months ago

Oh! Yes, the InterruptError is only invoked on the 2nd execution because the first execution was interrupted (SIGKILL).

You can't call code during a SIGKILL. The best you can do is put your code in an ensure.

Unless I'm misunderstanding the circumstances under which your job is being interrupted.

jgrau commented 2 months ago

Thank you! I think I have something working:

  around_perform do |_job, block|
      # slug and monitor configs are irrelevant for this example but they exist :)
      check_in_id = Sentry.capture_check_in(slug, :in_progress, monitor_config:)

      start = Sentry.utc_now.to_i
      result = :ok

      begin
        block.call
      rescue StandardError
        result = :error

        raise
      ensure
        duration = Sentry.utc_now.to_i - start

        Sentry.capture_check_in(
          slug,
          result,
          check_in_id:,
          duration:,
          monitor_config:,
        )
      end
    end

This does require setting a shutdown_timeout that is lower than the grace period or timeout of the system that runs the ruby process - eg. I believe kubernetes will have a default grace period of 30 seconds.

Rails.application.configure do
  ...
    # Wait 20 seconds for jobs to finish before shutting down. The kubernetes grace
  # period is 30 seconds so forcing a shutdown after 20 seconds will allow good_job
  # to handle the shutdown somewhat gracefully.
  config.good_job.shutdown_timeout = 20
end

I also think my attempts to reproduce production-behaviour locally was flawed: I was using foreman which has a default timeout of 5 seconds before sending SIGKILL so until I set config.good_job.shutdown_timeout = 1 my ensure block would not run.

Anyways - I believe this is working. Thanks for a great gem and your quick support :)