Closed bananastalktome closed 8 months ago
Yes - you are correct that Exceptions currently are being left on the queue (and therefore retried, based on the SQS queues policies). The original intention of leaving messages on the queue here was to ensure any generic/transient errors in the executor/message/job runner were retired. However, you are right that this ends up creating a conflicting and confusing experience - especially when you do configure a retry_on
- which will result in the job being retried more than the intended number of times. In the example you gave with attempts: 2
, active job will retry the job twice by deleting it from the queue and queueing another, new SQS message. Then once it reaches its max attempts, the exception bubbles up and is caught by the exception handler and left on the queue to be re-tried again and will continue until it reaches the queue's configured max reads (and then should be moved to a DLQ).
I do believe we should fix this behavior to align with the rails ActiveJob documentation and document the behavior and its interaction with SQS retry mechanisms in the readme.
https://guides.rubyonrails.org/active_job_basics.html#retrying-or-discarding-failed-jobs notes that failed jobs are not retried unless the jobs are configured otherwise, however SQS backed ActiveJobs that raise exceptions which are not explicitly discarded or retried continue to be run. This occurs with either the
amazon_sqs
oramazon_sqs_async
adapters. For example, a simple job such as:never gets deleted from the queue, and will be fetched and run again after the messages visibility_timeout expires. I am not sure if this is desired behavior or not, but it caught me off guard as it is different than the above note in the ActiveJob rails docs. A similar test with the Resque gem (being the only other queueing service I have familiarity with and an active project using) removed the job after a single failed run.
If this is intended behavior of the Rails SQS ActiveJob backend, a note in the Readme that the behavior differs from the rails noted behavior would be helpful.
Versions: Rails 7.1.3 Ruby 3.2.2 aws-sdk-rails 3.10.0 SQS standard queue (not FIFO) OSX Ventura 13.6.4
edit: It looks like even if a
retry_on Exception, wait: 40.seconds, attempts: 2
is added to the job, it's not removed after 2 attempts as would be expected. Are retries with limited attempts not supported with the SQS backend?