Aws::Rails::SqsActiveJob#execute captures StandardError which is too wide

aws / aws-sdk-rails

Official repository for the aws-sdk-rails gem, which integrates the AWS SDK for Ruby with Ruby on Rails.

Other

584 stars 61 forks source link

Aws::Rails::SqsActiveJob#execute captures StandardError which is too wide #120

Open jeropaul opened 1 month ago

jeropaul commented 1 month ago

Steps to reproduce

Set up AWS RDS to manage master passwords and rotate
Configure your database details via environment properties
Wait for your DB password to change

Once step 3 occurs my ActiveRecord jobs start throwing ActiveRecord::NoDatabaseError which is a subclass of StandardError

The current behaviour of Aws::Rails::SqsActiveJob#execute captures StandardError.

Expected behavior

1) I would expect an ActiveRecord::NoDatabaseError NOT to be caught. 2) I would expect the poller to die.

In my case the poller process finishing would kill the container which would be rescheduled with refreshed environment properties.

Actual behavior

The poller continues consuming from the queue, retrying messages at a great rate of speed.

System configuration

Rails version: 7.1

Ruby version: 3.3

jterapin commented 1 month ago

Thank you for the submitting this issue! We will investigate and get back to you.

alextwoods commented 1 month ago

This is a bit tricky - the logic for retrying StandardErrors is that arbitrary user code may throw transient StandardErrors which generally should be retried.

You can disable this behavior by configuring standard_error_retry to false (this logic was added recently in #115). However, this discards jobs rather than causing the entire poller to fail and stop running - in general, the poller was meant to be robust to errors in jobs. However, this seems like a special case where we want to detect such errors and actually kill the poller. I think we would need to come up with a list of these exceptions. ActiveRecordError - the superclass of NoDatabaseError - is likely too broad.

jeropaul commented 1 month ago

I don't expect that it will be possible to come up with a full and complete list of error classes that should be ignored.

Personally I'd be happy if I could configure a list of error classes that should be re-raised by the executor with some really tight default that I can expand over time.

In terms of prior art and it is a slightly different use case but off the top of my head NewRelic allows error classes to be ignored via configuration

Which for one of my apps looks like:

  error_collector:
    ignore_errors: "ActionController::RoutingError,Sinatra::NotFound,ActionDispatch::Http::MimeNegotiation::InvalidType"

TLDR:

continue to catch StandardError
re-raise if the class is included in a configured list

alextwoods commented 1 month ago

I've been doing some more investigation and thinking on this. I agree we can't really come up with a complete or useful list of error classes that should be considered fatal. (Side note: fatal may be a better description than ignored, since its some what the opposite. Those errors should cause the poller to raise the error and cause the worker to exit rather than being ignored).

I think rather that configuring a list, we should follow the user configured error handlers. That error handler could choose to re-raise specific exceptions, causing the poller to fail and exit.

I've marked this as a feature request and we'll create an internal story to track it. Or alternatively, we're happy to accept PRs!

jeropaul commented 1 month ago

we should follow the user configured error handlers.

Can you elaborate further in regards to what this solution looks like?

jeropaul commented 1 month ago

Are you referring to https://guides.rubyonrails.org/active_job_basics.html#exceptions where each job is expected to handle errors and anything not handled by a job causes the poller to fail.

alextwoods commented 1 month ago

No - when a job has un-handled errors, the job should fail but should not cause the entire poller to fail.

What I was thinking is that the executor's error handler could be configured by a user, similar to what is supported in Shoryuken's Exception handler.

What I'm thinking:

# interface is a callable with 3 args: exception, the sqs message (so that delete could be called) and the job
my_handler = proc do |exception, message, job|
  if should_terminate_poller?(exception) # some custom logic that inspects the exception
    raise exception # re-raise the exception, causing the poller to terminate
  end
  # some other custom handling
end
Aws::Rails::SqsActiveJob.configure do |config|
  config.exception_handlers = [my_handler, DefaultErrorHandler]
end

jeropaul commented 3 weeks ago

This is looking to be a bigger change than I was expecting!

The current behaviour of Concurrent::RubyThreadPoolExecutor which is the concrete implementation of Concurrent::ThreadPoolExecutor used by Aws::Rails::SqsActiveJob::Executor does not propagate exceptions. Even when Thread.abort_on_exception=true is configured. There are some details and work arounds discussed in https://github.com/ruby-concurrency/concurrent-ruby/issues/634 which has been open since 2017.

But the short of it is the current implementation provides no way for an exception raised when processing a message to terminate the polling process!