Open jeropaul opened 1 month ago
Thank you for the submitting this issue! We will investigate and get back to you.
This is a bit tricky - the logic for retrying StandardErrors
is that arbitrary user code may throw transient StandardErrors
which generally should be retried.
You can disable this behavior by configuring standard_error_retry
to false (this logic was added recently in #115). However, this discards jobs rather than causing the entire poller to fail and stop running - in general, the poller was meant to be robust to errors in jobs. However, this seems like a special case where we want to detect such errors and actually kill the poller. I think we would need to come up with a list of these exceptions. ActiveRecordError - the superclass of NoDatabaseError - is likely too broad.
I don't expect that it will be possible to come up with a full and complete list of error classes that should be ignored.
Personally I'd be happy if I could configure a list of error classes that should be re-raised by the executor with some really tight default that I can expand over time.
In terms of prior art and it is a slightly different use case but off the top of my head NewRelic allows error classes to be ignored via configuration
Which for one of my apps looks like:
error_collector:
ignore_errors: "ActionController::RoutingError,Sinatra::NotFound,ActionDispatch::Http::MimeNegotiation::InvalidType"
TLDR:
I've been doing some more investigation and thinking on this. I agree we can't really come up with a complete or useful list of error classes that should be considered fatal. (Side note: fatal may be a better description than ignored, since its some what the opposite. Those errors should cause the poller to raise the error and cause the worker to exit rather than being ignored).
I think rather that configuring a list, we should follow the user configured error handlers. That error handler could choose to re-raise specific exceptions, causing the poller to fail and exit.
I've marked this as a feature request and we'll create an internal story to track it. Or alternatively, we're happy to accept PRs!
we should follow the user configured error handlers.
Can you elaborate further in regards to what this solution looks like?
Are you referring to https://guides.rubyonrails.org/active_job_basics.html#exceptions where each job is expected to handle errors and anything not handled by a job causes the poller to fail.
No - when a job has un-handled errors, the job should fail but should not cause the entire poller to fail.
What I was thinking is that the executor's error handler could be configured by a user, similar to what is supported in Shoryuken's Exception handler.
What I'm thinking:
# interface is a callable with 3 args: exception, the sqs message (so that delete could be called) and the job
my_handler = proc do |exception, message, job|
if should_terminate_poller?(exception) # some custom logic that inspects the exception
raise exception # re-raise the exception, causing the poller to terminate
end
# some other custom handling
end
Aws::Rails::SqsActiveJob.configure do |config|
config.exception_handlers = [my_handler, DefaultErrorHandler]
end
This is looking to be a bigger change than I was expecting!
The current behaviour of Concurrent::RubyThreadPoolExecutor
which is the concrete implementation of Concurrent::ThreadPoolExecutor
used by Aws::Rails::SqsActiveJob::Executor
does not propagate exceptions. Even when Thread.abort_on_exception=true
is configured. There are some details and work arounds discussed in https://github.com/ruby-concurrency/concurrent-ruby/issues/634 which has been open since 2017.
But the short of it is the current implementation provides no way for an exception raised when processing a message to terminate the polling process!
Steps to reproduce
Once step 3 occurs my ActiveRecord jobs start throwing
ActiveRecord::NoDatabaseError
which is a subclass ofStandardError
The current behaviour of Aws::Rails::SqsActiveJob#execute captures
StandardError
.Expected behavior
1) I would expect an
ActiveRecord::NoDatabaseError
NOT to be caught. 2) I would expect the poller to die.In my case the poller process finishing would kill the container which would be rescheduled with refreshed environment properties.
Actual behavior
The poller continues consuming from the queue, retrying messages at a great rate of speed.
System configuration
Rails version: 7.1
Ruby version: 3.3