envato / event_sourcery

A library for building event sourced applications in Ruby
MIT License
84 stars 10 forks source link

custom error handler to catch when event processors throw exception #95

Closed mjward closed 7 years ago

mjward commented 7 years ago

Currently, the default behaviour for event processors using the ESPRunner is retry on exception. It prints the error message to the event sourcery log (which if configured spits it out to your application log) however it doesn't allow for you to get a handle on that error ie. to report to rollbar or raise alarm.

This change allows a proc to be passed into EventSourcery.config (similar to on_unknown_event) which is called when an exception is raised from within a processor.

Eg.

EventSourcery.configure do |config|
  ...
  config.on_event_processor_error = proc { |exception, processor_name|
    Payables::ErrorReporter.report(exception, processor_name: processor_name)
  }
...
end

IMPORTANT THINGS TO NOTE If you have stop_on_failure turned off (default) and configure rollbar to report exception, the processor will keep retrying forever and will continually report the same exception which can hit our account limit.

1 x processor stuck in this loop reporting every second will report ~86k exception per day. Now imagine if whatever reason 20 x processors in your app were unable to obtain a database connection... it can blow out very quickly.

In this case, you may want to enable stop_on_failure and implement retryable around the logic that you want retried in your processor, rather than relying on the overarching retry mechanism.

An alternative that has been suggested a few times is to introduce a exponential retry backoff which eventually kills the process as part of Event sourcery.

CC/ @orien @stevehodgkiss @andrewgr @stevend @Domon

andrewgr commented 7 years ago

Thanks for adding this @mjward

stevehodgkiss commented 7 years ago

1 x processor stuck in this loop reporting every second will report ~86k exception per day. Now imagine if whatever reason 20 x processors in your app were unable to obtain a database connection... it can blow out very quickly.

What happens in Identity is the processor crashes and sends a rollbar error, then it falls behind in the event stream and we get an alert about that after a few minutes (if we didn't notice the new rollbar error in slack). ECS is continually restarts the processor like ESPRunner but it takes maybe 10-15 seconds until it starts up again so not quite as fast. We then manually stop that specific processor in ECS, address the issue, deploy/start it again. There's no way it would keep bouncing like this for a whole day without somebody being alerted by the rollbar/slack noise or the pagerduty alert.

An alternative that has been suggested a few times is to introduce a exponential retry backoff which eventually kills the process as part of Event sourcery.

I think this is a reasonable approach.

Also worth noting that stop on error will bring down a processor if any transient error occurs (unless explicitly handled) and require attention from a developer to restart it.

andrewgr commented 7 years ago

As for stopping on errors, Hosted Service processors stop on unhandled errors to avoid producing incorrect output and to help detect unexpected input data. IMO stop_on_failure should be true by default for those two reasons.

I asked @notahat yesterday about their approach to handling unexpected errors in their non-event_sourcery based system and he said they also stop processing events if an error occurs. It looks like @mjward would also prefer to stop processing on an unexpected error. Shall we maybe set stop_on_failure to true by default since stopping processing on error is a relatively common pattern?

IMO, using retry is a good idea to handle cases like temporarily unreachable API endpoints. But perhaps retry or retryable should be explicitly used in application code rather than in the framework code?

mjward commented 7 years ago

There's no way it would keep bouncing like this for a whole day without somebody being alerted by the rollbar/slack noise or the pagerduty alert.

It really depends how rollbar is configured and whether rollbar is even reporting to slack or raising alarms as to how long it could go unnoticed. Stuff can break on weekends also. Its happened in the past where rogue exceptions were continually thrown to rollbar and it made us reach our limit for the company wide account.

@andrewgr yes, we conducted some analysis last week as to the places and conditions that specifically reactors could fail and questioned whether we wanted the process to keep working if it got into a weird state. The consensus we arrived at was we try and make reactors as resilient as possible, so if a reactor was to throw an exception, it would be very much unexpected. Allowing it to continue "reacting" with potentially corrupt state of the world felt dangerous and very unpredictable.

I think I agree that the retry mechanism should live in the application code where it is more visible to the thing that cares about it most and not buried deep down in the framework. Retry rules feel very specific to the application logic rather than the framework. I think it is nice that the retry is there in EventSourcery, I just think its a little heavy handed the way it continually keeps retrying forever with the same inputs expecting a different result. For transient errors maybe useful.

stevehodgkiss commented 7 years ago

It really depends how rollbar is configured and whether rollbar is even reporting to slack as to how long it could go unnoticed.

We have alerts on ESP lag. If an error is persistent we'll get an alert within 5 minutes regardless of anything to do with Rollbar. We report ESP lag to CloudWatch and have an alarm based off of it. If an ESP get's stuck on an event it will alert us because it will be behind where it should be.

As for stopping on errors, Hosted Service processors stop on unhandled errors to avoid producing incorrect output and to help detect unexpected input data. IMO stop_on_failure should be true by default for those two reasons.

@andrewgr What do you mean by produce incorrect output? Is it not safe to retry processing an event? @mjward in what scenario could state be corrupted? do you have an example?

I don't think stop on failure should be the default because that will make transient errors kill a processor. I think an exponential backoff and eventual stop like DJ/Sidekiq is the best choice here. I think it makes sense for the retry logic to live within the framework. Both DJ and Sidekiq have retry options built in and they're quite useful for the reasons I mentioned above (transient failures).

I think I agree that the retry mechanism should live in the application code where it is more visible to the thing that cares about it most and not buried deep down in the framework.

I think it should live within the framework, with the option to use it or not within the app.

mjward commented 7 years ago

Let me rephrase that, I don't think event sourcery should not have a retry mechanism. I have an issue with it retrying forever, but I agree with you that the the backoff would benefit transient errors and be a good fallback.

My thinking was more around allowing that logic to surface in our app and making that retry explicit, rather than relying on the framework to do it for us. Ie. if you're going to to retry submission to a payment gateway, I would like that to be very clear from whatever is handling that within the app, rather than it indirectly happening deep down in the framework.

As for some further explanation re: corrupted data...

A common pattern that we all use is if a Reactor needs to know about something that it subscribes to various events to build up internal state. This state is crucial as it relies on it to determine how the Reactor "reacts" to other events. On the happy path, we make the assumption that the Reactor will have everything that it needs, but what if it doesn't? If it looks for a row inside of an internal projection but its not there, or incomplete, what is it to do? It has 2 choices...

If we were to carry on and ignore it, it's not very clear what the outcome may be. Questions start arising and much head scratching follows. How do we replay whatever has been missed considering the event tracker number has proceeded? How does this leave the lifecycle of that aggregate? How does this impact reporting? Are other events chained that depend on the outcome of the thing occurring? Will more events be emitted with an incorrect representation of the current state of the world? How do we clean up incorrect state/projections? How difficult is this to debug?

If we stop, we have the opportunity to inspect and make a determination and therefore prevent the system from leaving things in an inconsistent or awkward state.

We think the safer option (especially considering we are dealing with money and financial reporting) is to stop immediately.

andrewgr commented 7 years ago

Thanks @mjward.

@stevehodgkiss by " incorrect output" I meant what Matt called "corrupted data" and explained in "Ignore it and carry on" approach outcomes.

stevehodgkiss commented 7 years ago

My thinking was more around allowing that logic to surface in our app and making that retry explicit, rather than relying on the framework to do it for us. Ie. if your going to to retry submission to a payment gateway, I would like that to be very clear from whatever is handling that within the app, rather than it indirectly happening deep down in the framework.

Yeah it sounds like the best option would be to be able to control the retry behaviour per ESP, much like you can with sidekiq. That way you could disable it where deemed necessary, but enable it in projections where everything is in a DB transaction, side effect free or should be, and safe to retry.

On the happy path, we make the assumption that the Reactor will have everything that it needs, but what if it doesn't? If it looks for a row inside of an internal projection but its not there, or incomplete, what is it to do? It has 2 choices...

I'm curious about how this scenario above would occur. A concrete example would be useful. Reactors aren't wrapped in DB transactions so I can imagine cases where there would be potentially duplication rows, or twice updated rows depending on how the reactor interacts with the table.

How do we replay whatever has been missed considering the event tracker number has proceeded?

How would the event be processed and then have corrupted state? A concrete example would be useful here too :)

We think the safer option (especially considering we are dealing with money and financial reporting) is to stop immediately.

Fair call, sounds reasonable. If we had retry options available to ESP's then you could just disable it for reactors or things that are too risky to retry.

mjward commented 7 years ago

How'd I know you were going to come back at me with a request of a more concrete example lol Leave it with me and ill formulate a couple, its often difficult to relay complex domain problems succinctly over text.

mjward commented 7 years ago

But you essentially hit the nail on the head with

Reactors aren't wrapped in DB transactions so I can imagine cases where there would be potentially duplication rows, or twice updated rows depending on how the reactor interacts with the table.

If a reactor is replaying an event it already knows about, that is something we want to know. We had the issue the other week where during a migration we accidentally restored the event_store db before the projections db (whilst the app was running). Fun times and it started to raise a lot of questions about what ifs.

But even a simple bug via a bad commit could knock stuff out and the consequences are not clear. It very much depends on how that state is being used.