Internal poison message handling

cgillum commented 6 years ago

Poison queue messages should be impossible if DurableTask.AzureStorage is implemented correctly, but there is no way to 100% guarantee this. The Azure Storage orchestration service should therefore have some kind of poison message handling to avoid generating unlimited amounts of log churn, storage I/O, and potentially billing charges.

Design Proposal

There are three different ways we could handle poison messages:

None: The current behavior, which is to keep retrying indefinitely. This should be default for 1.x but not for 2.x. This is problematic because it could add never-ending overhead to the application.
Discard: Drop a message which appears to be a poison message - i.e. it has a DequeueCount of 100 or more (configurable). Depending on the type of message (e.g. an activity function return value message), this could result in an orchestration instance getting permanently stuck, which is potentially a data loss scenario.
Save: Save the poison message into storage for later processing. This option could have two sub-options: a) suspend the orchestration instance (yet another new feature) until the message is dealt with or b) allow processing other messages for the instance. A command could be exposed that attempts to dequeue all poison messages when the user is ready to try again. Orchestrations don't generally depend on ordering from the queue (they manage their own ordering), so it should be okay to process other queue messages out of order. However, orchestrations could get stuck waiting for messages in the poison queue, and we would need to consider signaling this in the orchestration status.

cgillum commented 6 years ago

Another interesting scenario: if an activity function runs in the consumption plan and takes longer than 10 minutes to execute, this could result in an infinite crash loop. This is because the functions host will recycle to kill the long-running activity function. When the host restarts, it will try to re-execute that activity function and fail again (and this will continue indefinitely).

gorillapower commented 6 years ago

Wondering if the team has any sort of timeline on this issue? Is it on the radar for the next release at all? We have a situation where our functions are prone to infinite crash loop and its producing a lot of errors in our customers error logs.

cgillum commented 6 years ago

No specific timelines to share yet, other than that it won't be available for the v1.7 release (which is next). More likely it will make it's way into a v2.0 release.

Because any message is just one of a sequence of internal and inter-related messages, it's not really sufficient for us to create a simple "poison queue" like we would for a regular queue-trigger function. Instead, I suspect we'll need to introduce a new "suspended" state for orchestrations and come up with an experience for detecting, pausing, and resuming suspended orchestrations. This will need to be a new concept which is supported by the Durable Task Framework.

gorillapower commented 5 years ago

Just out of interest, we implemented a workaround where we threw an exception from runaway function on a table storage lookup.

The calling function would after 10mins insert a record in Table Storage (the invocationId of the runaway function)
The runaway function, after restarting, would do a lookup on Table Storage and throw exception or complete if a record was found, specifically its current invocationId.

This at least stops the runaway function for running for days on end.

ConnorMcMahon commented 3 years ago

This issue comes back to bite us periodically. We should prioritize this.

sebastianburckhardt commented 3 years ago

Came across this today and thinking about it some more. This is also related to some of the issues addressed in #1917.

Personally I think "poison" message is sort of a low-level term going back to queue-processing systems. Given that DF is built on higher-level abstractions (orchestrations, entities, activities) I think we should aim for solutions that make sense at that level. From what I understand there are two types of failures in particular that we are trying to "de-toxify":

Function timeouts. Our C# runtime can detect these when they occur, at least in the in-process case.
Process crashes (including out-of-memory errors). We cannot detect these at the moment they occur, but we can infer them after the fact by observing high dequeue counts.

The next question is how we surface the errors. My suggestion would be to stay as similar as possible to already-existing error handling mechanisms. In some cases, this is pretty straightforward, in others, not so much:

When an activity function is poisonous, we can report this to is the orchestrator that called it. We can raise an exception with descriptive information. I believe timeouts in activities already cause timeout exceptions today. The orchestrator can catch such exceptions and take action.
When an orchestrator function is poisonous, I would say we should treat this similarly to non-determinism exceptions, which are considered irrecoverable application errors, and directly move the orchestrator to an error state (?)
When an entity operation call is poisonous, we can throw an exception in the calling orchestration like for activities.
When an entity operation signal is poisonous, there is no obvious place (other than the logs) to report it... this is already an issue for any other exceptions. Perhaps we could support subscribing to exceptions in some way, TBD.

cgillum commented 3 years ago

I think this approach mostly works and would be straight forward to implement, especially for orchestrations and activities. I'm less confident about entities, however. There are a class of "poison" issues which will only be observed in the DTFx layer. For example, message deserialization issues. Since DTFx doesn't distinguish between orchestrations and entities today, we won't be able to easily implement entity-specific mitigations. We could still have entity-specific handling in the Azure Functions layer, but then we run the risk of having inconsistent poison message handling semantics for entities, which may frustrate customers.

sebastianburckhardt commented 3 years ago

Yes, we would need to handle the entity-specific poison message at the DF layer to give the right experience for entities. The DTFx layer does not have enough information. For example, it does not understand the meaning of orchestrator events (call, signal, lock, unlock, ...), and it cannot tell whether the poisonous behavior is in user code or in the framework code. Also, unlike orchestrations, entities must never terminate or they become permanently unusable.

I don't think this is necessarily a big issue. For now we can simply exclude entities (recognized by instance id) from the back-end poison message handling and do it in the DF layer instead. There is already plenty of error handling logic in the DF layer, so it makes sense to work with that. Basically I would need to add something that looks at the dequeue count and takes action when it is high.

I understand that ultimately it would be nice to eliminate the DF-layer logic for entities but that is not something we can do in a piecemeal approach. We will have to move all of that logic into DTFx at once at some point. Until then we are forced to work within the DF layer when fine-tuning the entity behavior.

cgillum commented 3 years ago

Regarding this:

When an orchestrator function is poisonous, I would say we should treat this similarly to non-determinism exceptions, which are considered irrecoverable application errors, and directly move the orchestrator to an error state (?)

I don't think we'll be able to easily surface this as an application error since the DTFx dispatcher executes outside the context of orchestrator code. Instead I think we'll need to directly transition the orchestrator into a failed state. One option could be to drop all the messages from the poisoned batch and inject a ExecutionTerminatedEvent into the orchestration history with an appropriate message. That would allow it to go through the normal termination cycle for that instance. I can't remember how or whether a terminated sub-orchestration instance gets bubbled up to a parent orchestration, but hopefully the behavior is reasonable (and if not, maybe we could fix that too).

ConnorMcMahon commented 3 years ago

Would it be possible to introduce a new exception type in DurableTask.Core that we can throw from any layer that DurableTask.Core knows how to interpret. Then we could throw that exception from anywhere and have the handling in one specific location.

The tricky part would be making sure the exception handling happens in a way that it actually surfaces all the way to DurableTask.Core, but we seem to have had some luck with this style approach for SessionAbortedException.

sebastianburckhardt commented 3 years ago

One option could be to drop all the messages from the poisoned batch and inject a ExecutionTerminatedEvent into the orchestration history with an appropriate message.

I like this except for the fact that one does not see what messages were dropped. Perhaps we can insert both the poisoned messages and the ExecutionTerminatedEvent in the history? This allows examination for forensic purposes. We just need to make sure that we are not trying to replay this history again later.

cgillum commented 3 years ago

I like this except for the fact that one does not see what messages were dropped.

We do already write logs with some details about any messages we dropped. Details include things like the instance ID, execution ID, message ID, event name (like ExecutionStarted or TaskCompleted), etc. Would this be sufficient for forensics?

Personally I think "poison" message is sort of a low-level term going back to queue-processing systems. Given that DF is built on higher-level abstractions (orchestrations, entities, activities) I think we should aim for solutions that make sense at that level.

I agree - I always feel weird when talking to customers/users about our "poison message" policy since messaging is an internal implementation detail that they shouldn't need to think about. :) I'm not sure of a better way to describe it, though, since we are actually using queues. I'd be curious to know if there's any prior art around this in other workflow systems. The closest example I can think of would be systems that would suspend an orchestration/workflow if an internal error was encountered while trying to process some event, but that's more of a description of the policy rather than the core issue.

Just for tracking purposes, I'll also add a link to a related discussion in the DTFx repo: https://github.com/Azure/durabletask/issues/553

yukscck commented 3 years ago

Wondering if this is expected to be worked on soon?

cgillum commented 2 years ago

@yukscck we don't yet have a plan for when this will be worked on.

Adding @AnatoliB for visibility.

cgillum commented 2 years ago

Adding @lilyjma for visibility on this item as well. We should try to find a way to prioritize this item as poison messages can potentially cause a lot of damage to an app in terms of performance and reliability.

davidmrdavid commented 2 years ago

I've added this to our internal taskboard so we don't miss it.

rafaeldnferreira commented 1 year ago

Hello I would like to share my thoughts on this topic around my perspective on handling poison messages as @cgillum has just linked the case we've been troubleshooting and all the data I've collected and the actions I've taken (deleting the poison messages) have resulted on restoring back the service levels to what we consider our normal/expected SLAs.

Please consider my comments only based on the Core DTF functionality as I'm not familiar with DFx/Durable Entities.

Reading the thread and with my limited knowledge yet on the inner works of the framework and from the perspective of an operator monitoring the live system in production I would expect:

A single representation of the instance ID that has failed to be posted in a DLQ containing enough details that explain the failure, from there I can build a facility around the existing APIs to potentially amend and retry these messages (I've seen this called Message Hospital on some systems), a couple of use cases that I can think of:
- The Type missing exception : That one I would confirm if it there's a new version available and re-schedule with the same input or just discard;
- Serialization errors on a given Task; That's much more complex, if my orchestrations has Several Tasks or sub-orchestrations, I would expected the instance ID to move to an status Aborted and to be able to restart as of the latest valid checkpoint.

From the perspective of the developer, I would expect:

An out of the box policy that is configurable as suggested based on Dequeue count, eg, try it 10 times and potentially some max TTL, the former much more valuable than the latter;
The possibility to specify to override the policy with my own implementation, so it would allow flexibility without changes to the core framework, something on the simple lines of:

interface IRetryPolicyHandler {
  bool ShouldRetry(WrappedMessage message); // contains meta data to allow logic/decision making
}

Regards

Azure / azure-functions-durable-extension

Internal poison message handling #338

Design Proposal