Open cgillum opened 6 years ago
Another interesting scenario: if an activity function runs in the consumption plan and takes longer than 10 minutes to execute, this could result in an infinite crash loop. This is because the functions host will recycle to kill the long-running activity function. When the host restarts, it will try to re-execute that activity function and fail again (and this will continue indefinitely).
Wondering if the team has any sort of timeline on this issue? Is it on the radar for the next release at all? We have a situation where our functions are prone to infinite crash loop and its producing a lot of errors in our customers error logs.
No specific timelines to share yet, other than that it won't be available for the v1.7 release (which is next). More likely it will make it's way into a v2.0 release.
Because any message is just one of a sequence of internal and inter-related messages, it's not really sufficient for us to create a simple "poison queue" like we would for a regular queue-trigger function. Instead, I suspect we'll need to introduce a new "suspended" state for orchestrations and come up with an experience for detecting, pausing, and resuming suspended orchestrations. This will need to be a new concept which is supported by the Durable Task Framework.
Just out of interest, we implemented a workaround where we threw an exception from runaway function on a table storage lookup.
This at least stops the runaway function for running for days on end.
This issue comes back to bite us periodically. We should prioritize this.
Came across this today and thinking about it some more. This is also related to some of the issues addressed in #1917.
Personally I think "poison" message is sort of a low-level term going back to queue-processing systems. Given that DF is built on higher-level abstractions (orchestrations, entities, activities) I think we should aim for solutions that make sense at that level. From what I understand there are two types of failures in particular that we are trying to "de-toxify":
The next question is how we surface the errors. My suggestion would be to stay as similar as possible to already-existing error handling mechanisms. In some cases, this is pretty straightforward, in others, not so much:
I think this approach mostly works and would be straight forward to implement, especially for orchestrations and activities. I'm less confident about entities, however. There are a class of "poison" issues which will only be observed in the DTFx layer. For example, message deserialization issues. Since DTFx doesn't distinguish between orchestrations and entities today, we won't be able to easily implement entity-specific mitigations. We could still have entity-specific handling in the Azure Functions layer, but then we run the risk of having inconsistent poison message handling semantics for entities, which may frustrate customers.
Yes, we would need to handle the entity-specific poison message at the DF layer to give the right experience for entities. The DTFx layer does not have enough information. For example, it does not understand the meaning of orchestrator events (call, signal, lock, unlock, ...), and it cannot tell whether the poisonous behavior is in user code or in the framework code. Also, unlike orchestrations, entities must never terminate or they become permanently unusable.
I don't think this is necessarily a big issue. For now we can simply exclude entities (recognized by instance id) from the back-end poison message handling and do it in the DF layer instead. There is already plenty of error handling logic in the DF layer, so it makes sense to work with that. Basically I would need to add something that looks at the dequeue count and takes action when it is high.
I understand that ultimately it would be nice to eliminate the DF-layer logic for entities but that is not something we can do in a piecemeal approach. We will have to move all of that logic into DTFx at once at some point. Until then we are forced to work within the DF layer when fine-tuning the entity behavior.
Regarding this:
When an orchestrator function is poisonous, I would say we should treat this similarly to non-determinism exceptions, which are considered irrecoverable application errors, and directly move the orchestrator to an error state (?)
I don't think we'll be able to easily surface this as an application error since the DTFx dispatcher executes outside the context of orchestrator code. Instead I think we'll need to directly transition the orchestrator into a failed state. One option could be to drop all the messages from the poisoned batch and inject a ExecutionTerminatedEvent
into the orchestration history with an appropriate message. That would allow it to go through the normal termination cycle for that instance. I can't remember how or whether a terminated sub-orchestration instance gets bubbled up to a parent orchestration, but hopefully the behavior is reasonable (and if not, maybe we could fix that too).
Would it be possible to introduce a new exception type in DurableTask.Core that we can throw from any layer that DurableTask.Core knows how to interpret. Then we could throw that exception from anywhere and have the handling in one specific location.
The tricky part would be making sure the exception handling happens in a way that it actually surfaces all the way to DurableTask.Core, but we seem to have had some luck with this style approach for SessionAbortedException
.
One option could be to drop all the messages from the poisoned batch and inject a ExecutionTerminatedEvent into the orchestration history with an appropriate message.
I like this except for the fact that one does not see what messages were dropped. Perhaps we can insert both the poisoned messages and the ExecutionTerminatedEvent in the history? This allows examination for forensic purposes. We just need to make sure that we are not trying to replay this history again later.
I like this except for the fact that one does not see what messages were dropped.
We do already write logs with some details about any messages we dropped. Details include things like the instance ID, execution ID, message ID, event name (like ExecutionStarted
or TaskCompleted
), etc. Would this be sufficient for forensics?
Personally I think "poison" message is sort of a low-level term going back to queue-processing systems. Given that DF is built on higher-level abstractions (orchestrations, entities, activities) I think we should aim for solutions that make sense at that level.
I agree - I always feel weird when talking to customers/users about our "poison message" policy since messaging is an internal implementation detail that they shouldn't need to think about. :) I'm not sure of a better way to describe it, though, since we are actually using queues. I'd be curious to know if there's any prior art around this in other workflow systems. The closest example I can think of would be systems that would suspend an orchestration/workflow if an internal error was encountered while trying to process some event, but that's more of a description of the policy rather than the core issue.
Just for tracking purposes, I'll also add a link to a related discussion in the DTFx repo: https://github.com/Azure/durabletask/issues/553
Wondering if this is expected to be worked on soon?
@yukscck we don't yet have a plan for when this will be worked on.
Adding @AnatoliB for visibility.
Adding @lilyjma for visibility on this item as well. We should try to find a way to prioritize this item as poison messages can potentially cause a lot of damage to an app in terms of performance and reliability.
I've added this to our internal taskboard so we don't miss it.
Hello I would like to share my thoughts on this topic around my perspective on handling poison messages as @cgillum has just linked the case we've been troubleshooting and all the data I've collected and the actions I've taken (deleting the poison messages) have resulted on restoring back the service levels to what we consider our normal/expected SLAs.
Please consider my comments only based on the Core DTF functionality as I'm not familiar with DFx/Durable Entities.
Reading the thread and with my limited knowledge yet on the inner works of the framework and from the perspective of an operator monitoring the live system in production I would expect:
Aborted
and to be able to restart as of the latest valid checkpoint.From the perspective of the developer, I would expect:
interface IRetryPolicyHandler {
bool ShouldRetry(WrappedMessage message); // contains meta data to allow logic/decision making
}
Regards
Poison queue messages should be impossible if DurableTask.AzureStorage is implemented correctly, but there is no way to 100% guarantee this. The Azure Storage orchestration service should therefore have some kind of poison message handling to avoid generating unlimited amounts of log churn, storage I/O, and potentially billing charges.
Related issue: https://github.com/Azure/azure-functions-durable-extension/issues/334
Design Proposal
There are three different ways we could handle poison messages: