Open moldovangeorge opened 1 year ago
Hei, @cgillum , @sebastianburckhardt, could you help with a review here please ?
@cgillum - mind giving this a quick check?
Just to summarize, I understand that you're proposing two things:
TypeMissingException
and instead rely on the generic exception handling mechanism for all exception types.I'm fine with (1) but worried about the unintended consequences of (2).
I think skipping the SafeRelease on the exception flow does not present any risks, since providers that have not yet implemented the 2 methods from above, will rely on the automatic expiration of the lock in this case, without an explicit release.
When you say "providers that have not yet implemented the 2 methods from above", which two methods are you referring to?
By removing the release from the error handling flow, you're basically guaranteeing that any unhandled exception in message processing will result in the message remaining locked for the full length of the message timeout, which could be several minutes. This is a huge penalty to pay, particularly for transient issues, and our users will definitely notice and complain about this.
Let me know if I'm misunderstanding something.
By the 2 methods from above I mean GetDelayInSecondsAfterOnProcessException
and AbortWorkItem
. These 2 methods should determine the behavior for failed tasks for a given provider. AbortWorkItem
is actually the AbandonTaskOrchestrationWorkItemAsync
in the IOrchestrationService
.
Removing the release from the error handling flow will only result in the message remaining locked if a given provider does not implement the GetDelayInSecondsAfterOnProcessException
and AbortWorkItem
methods, but does implement ReleaseTaskOrchestrationWorkItemAsync
. I will analyze if such a provider exists today.
To verify if there will be indeed a penalty for this change in the existing providers, I performed the following analysis :
AbandonTaskOrchestrationWorkItemAsync
and ReleaseTaskOrchestrationWorkItemAsync
. No performance penalty will be caused by long lock acquisition. The current implementation of GetDelayInSecondsAfterOnProcessException
is to insert a global back-off for transient exceptions and 0 otherwise. This will mean that tasks that failed with TypeMissingException
will be retried immediately and indefinitely until a worker eventually is able to pick it up. It is not ideal, but I think it's a better middle ground than the current situation where such an exception will cause all workers to eventually stop processing until that task is forcefully removed from the queue. AbandonTaskOrchestrationWorkItemAsync
(as you described here to avoid tight failure loops. No penalty for this provider. The current implementation of GetDelayInSecondsAfterOnProcessException
is to have a global back-off of 10 seconds for all exceptions. This coupled with the implementation of the Abort operation in this provider I think makes sense and has very low chances of both stopping processing or over retrying tasks that are going to fail anyway. GetDelayInSecondsAfterOnProcessException
is to have 1 or 2 seconds global back-off for FabricNotReadableException and TimeoutException and 0 otherwise. Same vulnerability for over retry-ing as the ServiceBus provider. GetDelayInSecondsAfterOnProcessException
. Same vulnerability as ServiceBus provider.After analyzing the above, my conclusion is :
Before this change, all providers had an issue: If a few tasks that generated TypeMissingException
were present in the queue, this would cause a serious performance degradation leading workers to eventually stop completely.
After this change: Providers will maintain the same flow for handling different exceptions as before, no performance penalty will be introduced, and they will be able to handle the TypeMissingException explicitly through their implementations of GetDelayInSecondsAfterOnProcessException
and AbandonTaskOrchestrationWorkItemAsync
. Specifically, most of the providers, with their current implementation will replace this performance degradation with continuous retrying of the tasks that generate the TypeMissingException
. My personal opinion is that this is a good middle-ground that is easily fixable in all the providers by implementing the same type of logic (exponential back-off for a given task) that is now present in the DurableTask.AzureStorage provider for their AbandonTaskOrchestrationWorkItemAsync
method.
Please let me know if I missed something or if this is not aligned with the long-term vision of the OrchestrationService.
@moldovangeorge please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:
- (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
- (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement
@microsoft-github-policy-service agree company="UiPath"
@cgillum Gentle reminder to take a look at the above.
Hi @moldovangeorge. I read your last response but I'm still not sure if it covers my previous question:
By removing the release from the error handling flow, you're basically guaranteeing that any unhandled exception in message processing will result in the message remaining locked for the full length of the message timeout, which could be several minutes. This is a huge penalty to pay, particularly for transient issues, and our users will definitely notice and complain about this.
Can you re-explain to me why we're moving the "release" out of the
finally
block and instead only executing it in thetry
block?
Hei @cgillum , sure : The presence of SafeReleaseWorkItem on the exception path was forcing a fast-retry behaviour on the exception path for the DTF providers that implemented the SafeReleaseLogic. While the global back-off was in place, this was not an issue because the retries were delayed by the global back-off anyway. With this change, now that the global back-off will disappear, keeping the SafeReleaseWorkItem on the error handling flow would result in an immediate retry of the task (For the providers that implement ReleaseTaskOrchestrationWorkItemAsync ), making us go in the other extreme: from global back-off to immediate and never-ending retry.
So the reason for excluding the SafeReleaseWorkItem from the error handling flow is to avoid excessive retries and to give the liberty to the provider implementation to choose how to handle broken tasks via the GetDelayInSecondsAfterOnProcessException
and AbortWorkItem
methods.
The removal of the SafeReleaseWorkItem would
guarantee that any unhandled exception in message processing will result in the message remaining locked for the full length of the message timeout
only if there would be a provider which
does not implement the GetDelayInSecondsAfterOnProcessException and AbortWorkItem methods, but does implement ReleaseTaskOrchestrationWorkItemAsync.
My analysis from the previous comment tried to find out if there is any DTF provider that would be affected by this change, and I could not find any provider that would suffer a change in the behaviour related to failed tasks, after this change would be merged.
Thanks @moldovangeorge - I agree that removing the global backoff is a good thing and I understand the concern about creating a tight failure loop for failures generally. One more understanding that I'd like to confirm before going forward with this:
...give the liberty to the provider implementation to choose how to handle broken tasks via the
GetDelayInSecondsAfterOnProcessException
andAbortWorkItem
methods.
If I understand correctly:
GetDelayInSecondsAfterOnProcessException
allows backend implementations to introduce the global backoff behavior if they want it. The default is no global backoff.AbortWorkItem
let's implementations decide if they want to release the lock on the work item or let it remain locked until the lock timeout expires. The default (based on current implementations) is that the work item will remain locked until the timeout expires.Is this accurate? If so, I think this is fine. However, I am a bit concerned about this:
most of the providers, with their current implementation will replace this performance degradation with continuous retrying of the tasks that generate the TypeMissingException.
As you mentioned, DurableTask.AzureStorage (the most popular backend) is protected because the abandon logic already does exponential backoff. DurableTask.ServiceBus (the original/oldest provider) might be okay because I think there should be some kind of poison message handling in the Service Bus layer to protect it from runaway errors. I'm a bit worried about the Service Fabric implementation, however. I wonder if we should include TypeMissingException
to the list of exceptions handled by GetDelayInSecondsAfterOnProcessException
to avoid an unexpected change in behavior. It seems that the Netherite provider could also experience an unpleasant behavior change if there's a TypeMissingException
.
Adding @shankarsama and @sebastianburckhardt for FYI on this behavior change since this will affect the provider implementations you maintain. I'm inclined to accept this change, so please speak up if you have concerns.
GetDelayInSecondsAfterOnProcessException allows backend implementations to introduce the global backoff behavior if they want it. The default is no global backoff. AbortWorkItem lets implementations decide if they want to release the lock on the work item or let it remain locked until the lock timeout expires. The default (based on current implementations) is that the work item will remain locked until the timeout expires.
Yes, @cgillum your understanding from above is accurate.
For the 3 providers that you mentioned (ServiceFabric, ServiceBus and Netherite) yes, the behaviour for TypeMissingException
will change from worker degradation until complete hault
to continuous retries of tasks that generate TypeMissingExceptions
. My personal opinion is that this is a good step forward, that can be further improved by implementing in these providers the same type of behaviour that the Azure Storage provider has for GetDelayInSecondsAfterOnProcessException
and AbortWorkItem
. From the perspective of a user of this technology, I would rather have a task stuck in the queue and infinitely retrying than have my workers completely stopped because of some poisonous tasks. It's not perfect, but it's easily fixable at the provider level.
Hei @cgillum, @shankarsama, @sebastianburckhardt, are there any updates on this matter?
@moldovangeorge: Just to build up context - is there a particular storage provider you're most interested in here? Say, if we improved this behavior just for Azure Storage but not for the other backends, would that work for your scenario? I'm trying to figure out if a scoped change like that would be easier to merge. As it stands, affecting all storage providers will require multiple stakeholders to chime in, and that coordination is tricky.
@davidmrdavid This is not fixable at the provider level. Without this change, all providers suffer from the same vulnerability: a serious performance degradation from TypeMissingException. Since the root cause for this is found in the DTF Core, there is no provider-level alternative for fixing this.
@moldovangeorge. I took a closer look at the code, and I see what you mean now. I'm just trying to reduce the coordination effort needed to merge this as, for example, the Netherite SME is not available at this time to chime in (on vacation) and they could probably be a blocker here.
A way to make this immediately safer to merge would be to turn this into opt-in behavior. Perhaps this can be a boolean setting that users opt into, but it is by default turned off. That would allow us to merge this behavior more easily, without having to worry much about implicitly changing the behavior of storage providers.
Hei @davidmrdavid, as long as the necessary people will eventually sign-off on this, I'm fine waiting, this has been open already for 9months, and at this point I would rather wait a little longer since transforming this into an opt-in behaviour might add some clutter in a critical part of the Core project.
Fair enough. I'll tag @sebastianburckhardt here for his 2 cents once he's back from OOF
Hei all, @davidmrdavid, @sebastianburckhardt, just trying to bump this for visibility, maybe we can conclude and close this. Thanks!
This change addresses the issue described here : https://github.com/Azure/durabletask/issues/886