Sidekiq Dead Queue Analysis

Notes from CoP meeting:

Who should take care of the dead jobs? No one. They go there to die.
Sidekiq prod UI access is restricted, so some teams might not even know about all the dead jobs.
An alert would notify the team responsible, ideally.
- What is the protocol for this? We can make alerts all day, but who's going to care?
- Bring up with Bill Chapman and the SRE board (#platform-sre-advisory-board)?
Could we create a monitor based on logging from dead jobs in Sidekiq?
Should we clean up the dead jobs first and start with a clean slate?
Can we use use the slack-notifier gem? blog post here
Custom sidekiq middleware? + slack-notifier?
slack-notify or slack-notifier gem?

example of why dead jobs matter: a veteran submits a form and it fails to submit. It retries 25 times and becomes a dead sidekiq job. That form will not be submitted.

Keep in mind to NOT send information in the Arguments field, since that can contain PII.

Something like this could be done:

class SidekiqDeadQueueRescue
  def call(worker, job, queue)
    begin
      yield
    rescue => e
      if job['retry_count'].to_i + 1 >= worker.class.get_sidekiq_options['retry']
        # Log or take action here
        Rails.logger.error("Job #{job['jid']} is going to the dead queue: #{e.message}") # OR SEND A SLACK MESSAGE
        # Optionally, send metrics or alerts to Datadog here
      end

      # Re-raise the exception to let Sidekiq handle it
      raise e
    end
  end
end

require 'sidekiq_dead_queue_rescue'

Sidekiq.configure_server do |config|
  config.server_middleware do |chain|
    chain.add SidekiqDeadQueueRescue
  end
end

department-of-veterans-affairs / va.gov-team

Sidekiq Dead Queue Analysis #70579

Description