Open LindseySaari opened 7 months ago
Notes from CoP meeting:
slack-notifier
gem? blog post hereslack-notifier
? example of why dead jobs matter: a veteran submits a form and it fails to submit. It retries 25 times and becomes a dead sidekiq job. That form will not be submitted.
Keep in mind to NOT send information in the Arguments field, since that can contain PII.
Something like this could be done:
class SidekiqDeadQueueRescue
def call(worker, job, queue)
begin
yield
rescue => e
if job['retry_count'].to_i + 1 >= worker.class.get_sidekiq_options['retry']
# Log or take action here
Rails.logger.error("Job #{job['jid']} is going to the dead queue: #{e.message}") # OR SEND A SLACK MESSAGE
# Optionally, send metrics or alerts to Datadog here
end
# Re-raise the exception to let Sidekiq handle it
raise e
end
end
end
require 'sidekiq_dead_queue_rescue'
Sidekiq.configure_server do |config|
config.server_middleware do |chain|
chain.add SidekiqDeadQueueRescue
end
end
Description
During the COLA maintenance, we observed a high number of job retries. This led us to evaluate our approach to automated retries and consider the visibility of dead jobs in Sidekiq. Currently, our main source of insight into these dead jobs is the Sidekiq UI. Sidekiq jobs typically retry 25 times with exponential backoff before being moved to the dead queue. Once in the dead queue, these jobs remain inactive unless manually retried. We should think through implementing alerts for when jobs enter the dead queue, or possibly creating a Datadog dashboard for better monitoring.