aws / aws-sdk-rails

Official repository for the aws-sdk-rails gem, which integrates the AWS SDK for Ruby with Ruby on Rails.
Other
596 stars 62 forks source link

SQS ActiveJobs are retried if uncaught exceptions are raised or retry attempts are exceeded #114

Closed bananastalktome closed 8 months ago

bananastalktome commented 8 months ago

https://guides.rubyonrails.org/active_job_basics.html#retrying-or-discarding-failed-jobs notes that failed jobs are not retried unless the jobs are configured otherwise, however SQS backed ActiveJobs that raise exceptions which are not explicitly discarded or retried continue to be run. This occurs with either the amazon_sqs or amazon_sqs_async adapters. For example, a simple job such as:

class SampleJob < ActiveJob::Base
  queue_as :default

  def perform
    raise Exception, "testing"
  end
end

never gets deleted from the queue, and will be fetched and run again after the messages visibility_timeout expires. I am not sure if this is desired behavior or not, but it caught me off guard as it is different than the above note in the ActiveJob rails docs. A similar test with the Resque gem (being the only other queueing service I have familiarity with and an active project using) removed the job after a single failed run.

If this is intended behavior of the Rails SQS ActiveJob backend, a note in the Readme that the behavior differs from the rails noted behavior would be helpful.

Versions: Rails 7.1.3 Ruby 3.2.2 aws-sdk-rails 3.10.0 SQS standard queue (not FIFO) OSX Ventura 13.6.4

edit: It looks like even if a retry_on Exception, wait: 40.seconds, attempts: 2 is added to the job, it's not removed after 2 attempts as would be expected. Are retries with limited attempts not supported with the SQS backend?

alextwoods commented 8 months ago

Yes - you are correct that Exceptions currently are being left on the queue (and therefore retried, based on the SQS queues policies). The original intention of leaving messages on the queue here was to ensure any generic/transient errors in the executor/message/job runner were retired. However, you are right that this ends up creating a conflicting and confusing experience - especially when you do configure a retry_on - which will result in the job being retried more than the intended number of times. In the example you gave with attempts: 2, active job will retry the job twice by deleting it from the queue and queueing another, new SQS message. Then once it reaches its max attempts, the exception bubbles up and is caught by the exception handler and left on the queue to be re-tried again and will continue until it reaches the queue's configured max reads (and then should be moved to a DLQ).

I do believe we should fix this behavior to align with the rails ActiveJob documentation and document the behavior and its interaction with SQS retry mechanisms in the readme.