department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
280 stars 195 forks source link

Sidekiq Dead Queue Analysis #70579

Open LindseySaari opened 7 months ago

LindseySaari commented 7 months ago

Description

During the COLA maintenance, we observed a high number of job retries. This led us to evaluate our approach to automated retries and consider the visibility of dead jobs in Sidekiq. Currently, our main source of insight into these dead jobs is the Sidekiq UI. Sidekiq jobs typically retry 25 times with exponential backoff before being moved to the dead queue. Once in the dead queue, these jobs remain inactive unless manually retried. We should think through implementing alerts for when jobs enter the dead queue, or possibly creating a Datadog dashboard for better monitoring.

rmtolmach commented 7 months ago

Notes from CoP meeting:

example of why dead jobs matter: a veteran submits a form and it fails to submit. It retries 25 times and becomes a dead sidekiq job. That form will not be submitted.

Keep in mind to NOT send information in the Arguments field, since that can contain PII.

Something like this could be done:

class SidekiqDeadQueueRescue
  def call(worker, job, queue)
    begin
      yield
    rescue => e
      if job['retry_count'].to_i + 1 >= worker.class.get_sidekiq_options['retry']
        # Log or take action here
        Rails.logger.error("Job #{job['jid']} is going to the dead queue: #{e.message}") # OR SEND A SLACK MESSAGE
        # Optionally, send metrics or alerts to Datadog here
      end

      # Re-raise the exception to let Sidekiq handle it
      raise e
    end
  end
end

require 'sidekiq_dead_queue_rescue'

Sidekiq.configure_server do |config|
  config.server_middleware do |chain|
    chain.add SidekiqDeadQueueRescue
  end
end