Prevent performance degradation from TypeMissingException

moldovangeorge commented 1 year ago

This change addresses the issue described here : https://github.com/Azure/durabletask/issues/886

moldovangeorge commented 1 year ago

Hei, @cgillum , @sebastianburckhardt, could you help with a review here please ?

davidmrdavid commented 1 year ago

@cgillum - mind giving this a quick check?

cgillum commented 11 months ago

Just to summarize, I understand that you're proposing two things:

Remove the catching of TypeMissingException and instead rely on the generic exception handling mechanism for all exception types.
Only release the work item in the success case.

I'm fine with (1) but worried about the unintended consequences of (2).

I think skipping the SafeRelease on the exception flow does not present any risks, since providers that have not yet implemented the 2 methods from above, will rely on the automatic expiration of the lock in this case, without an explicit release.

When you say "providers that have not yet implemented the 2 methods from above", which two methods are you referring to?

By removing the release from the error handling flow, you're basically guaranteeing that any unhandled exception in message processing will result in the message remaining locked for the full length of the message timeout, which could be several minutes. This is a huge penalty to pay, particularly for transient issues, and our users will definitely notice and complain about this.

Let me know if I'm misunderstanding something.

moldovangeorge commented 11 months ago

By the 2 methods from above I mean GetDelayInSecondsAfterOnProcessException and AbortWorkItem. These 2 methods should determine the behavior for failed tasks for a given provider. AbortWorkItem is actually the AbandonTaskOrchestrationWorkItemAsync in the IOrchestrationService. Removing the release from the error handling flow will only result in the message remaining locked if a given provider does not implement the GetDelayInSecondsAfterOnProcessException and AbortWorkItem methods, but does implement ReleaseTaskOrchestrationWorkItemAsync. I will analyze if such a provider exists today.

To verify if there will be indeed a penalty for this change in the existing providers, I performed the following analysis :

DurableTask.ServiceBus: Implements both AbandonTaskOrchestrationWorkItemAsync and ReleaseTaskOrchestrationWorkItemAsync. No performance penalty will be caused by long lock acquisition. The current implementation of GetDelayInSecondsAfterOnProcessException is to insert a global back-off for transient exceptions and 0 otherwise. This will mean that tasks that failed with TypeMissingException will be retried immediately and indefinitely until a worker eventually is able to pick it up. It is not ideal, but I think it's a better middle ground than the current situation where such an exception will cause all workers to eventually stop processing until that task is forcefully removed from the queue.
DurableTask.AzureStorage: Same as DurableTask.ServiceBus, but with better handling of AbandonTaskOrchestrationWorkItemAsync (as you described here to avoid tight failure loops. No penalty for this provider. The current implementation of GetDelayInSecondsAfterOnProcessException is to have a global back-off of 10 seconds for all exceptions. This coupled with the implementation of the Abort operation in this provider I think makes sense and has very low chances of both stopping processing or over retrying tasks that are going to fail anyway.
DurableTask.AzureServiceFabric: Same as DurableTask.ServiceBus, no penalty for this provider. The current implementation of GetDelayInSecondsAfterOnProcessException is to have 1 or 2 seconds global back-off for FabricNotReadableException and TimeoutException and 0 otherwise. Same vulnerability for over retry-ing as the ServiceBus provider.
DurableTask.Netherite: It implements the AbandonTaskOrchestrationWorkItemAsync but does not implement the ReleaseTaskOrchestrationWorkItemAsync. Since the release was not doing anything anyway, no penalty here as well. 0 seconds delay for the GetDelayInSecondsAfterOnProcessException. Same vulnerability as ServiceBus provider.
DurableTask.SqlServer - No implementation for any of the 3 methods (ReleaseTaskOrchestrationWorkItemAsync, AbandonTaskOrchestrationWorkItemAsync, GetDelayInSecondsAfterOnProcessException) so this provider allready had a penalty for any task that suffered an exception, because the lock was not released.
DurableTask.Emulator - Out of scope for this change.

After analyzing the above, my conclusion is : Before this change, all providers had an issue: If a few tasks that generated TypeMissingException were present in the queue, this would cause a serious performance degradation leading workers to eventually stop completely. After this change: Providers will maintain the same flow for handling different exceptions as before, no performance penalty will be introduced, and they will be able to handle the TypeMissingException explicitly through their implementations of GetDelayInSecondsAfterOnProcessException and AbandonTaskOrchestrationWorkItemAsync. Specifically, most of the providers, with their current implementation will replace this performance degradation with continuous retrying of the tasks that generate the TypeMissingException. My personal opinion is that this is a good middle-ground that is easily fixable in all the providers by implementing the same type of logic (exponential back-off for a given task) that is now present in the DurableTask.AzureStorage provider for their AbandonTaskOrchestrationWorkItemAsync method.

Please let me know if I missed something or if this is not aligned with the long-term vision of the OrchestrationService.

moldovangeorge commented 11 months ago

@moldovangeorge please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree company="UiPath"

moldovangeorge commented 10 months ago

@cgillum Gentle reminder to take a look at the above.

moldovangeorge commented 7 months ago

Hi @moldovangeorge. I read your last response but I'm still not sure if it covers my previous question:

By removing the release from the error handling flow, you're basically guaranteeing that any unhandled exception in message processing will result in the message remaining locked for the full length of the message timeout, which could be several minutes. This is a huge penalty to pay, particularly for transient issues, and our users will definitely notice and complain about this.

Can you re-explain to me why we're moving the "release" out of the finally block and instead only executing it in the try block?

Hei @cgillum , sure : The presence of SafeReleaseWorkItem on the exception path was forcing a fast-retry behaviour on the exception path for the DTF providers that implemented the SafeReleaseLogic. While the global back-off was in place, this was not an issue because the retries were delayed by the global back-off anyway. With this change, now that the global back-off will disappear, keeping the SafeReleaseWorkItem on the error handling flow would result in an immediate retry of the task (For the providers that implement ReleaseTaskOrchestrationWorkItemAsync ), making us go in the other extreme: from global back-off to immediate and never-ending retry.

So the reason for excluding the SafeReleaseWorkItem from the error handling flow is to avoid excessive retries and to give the liberty to the provider implementation to choose how to handle broken tasks via the GetDelayInSecondsAfterOnProcessException and AbortWorkItem methods. The removal of the SafeReleaseWorkItem would

guarantee that any unhandled exception in message processing will result in the message remaining locked for the full length of the message timeout

only if there would be a provider which

does not implement the GetDelayInSecondsAfterOnProcessException and AbortWorkItem methods, but does implement ReleaseTaskOrchestrationWorkItemAsync.

My analysis from the previous comment tried to find out if there is any DTF provider that would be affected by this change, and I could not find any provider that would suffer a change in the behaviour related to failed tasks, after this change would be merged.

cgillum commented 7 months ago

Thanks @moldovangeorge - I agree that removing the global backoff is a good thing and I understand the concern about creating a tight failure loop for failures generally. One more understanding that I'd like to confirm before going forward with this:

...give the liberty to the provider implementation to choose how to handle broken tasks via the GetDelayInSecondsAfterOnProcessException and AbortWorkItem methods.

If I understand correctly:

GetDelayInSecondsAfterOnProcessException allows backend implementations to introduce the global backoff behavior if they want it. The default is no global backoff.
AbortWorkItem let's implementations decide if they want to release the lock on the work item or let it remain locked until the lock timeout expires. The default (based on current implementations) is that the work item will remain locked until the timeout expires.

Is this accurate? If so, I think this is fine. However, I am a bit concerned about this:

most of the providers, with their current implementation will replace this performance degradation with continuous retrying of the tasks that generate the TypeMissingException.

As you mentioned, DurableTask.AzureStorage (the most popular backend) is protected because the abandon logic already does exponential backoff. DurableTask.ServiceBus (the original/oldest provider) might be okay because I think there should be some kind of poison message handling in the Service Bus layer to protect it from runaway errors. I'm a bit worried about the Service Fabric implementation, however. I wonder if we should include TypeMissingException to the list of exceptions handled by GetDelayInSecondsAfterOnProcessException to avoid an unexpected change in behavior. It seems that the Netherite provider could also experience an unpleasant behavior change if there's a TypeMissingException.

Adding @shankarsama and @sebastianburckhardt for FYI on this behavior change since this will affect the provider implementations you maintain. I'm inclined to accept this change, so please speak up if you have concerns.

moldovangeorge commented 7 months ago

GetDelayInSecondsAfterOnProcessException allows backend implementations to introduce the global backoff behavior if they want it. The default is no global backoff. AbortWorkItem lets implementations decide if they want to release the lock on the work item or let it remain locked until the lock timeout expires. The default (based on current implementations) is that the work item will remain locked until the timeout expires.

Yes, @cgillum your understanding from above is accurate.

For the 3 providers that you mentioned (ServiceFabric, ServiceBus and Netherite) yes, the behaviour for TypeMissingException will change from worker degradation until complete hault to continuous retries of tasks that generate TypeMissingExceptions. My personal opinion is that this is a good step forward, that can be further improved by implementing in these providers the same type of behaviour that the Azure Storage provider has for GetDelayInSecondsAfterOnProcessException and AbortWorkItem. From the perspective of a user of this technology, I would rather have a task stuck in the queue and infinitely retrying than have my workers completely stopped because of some poisonous tasks. It's not perfect, but it's easily fixable at the provider level.

moldovangeorge commented 6 months ago

Hei @cgillum, @shankarsama, @sebastianburckhardt, are there any updates on this matter?

davidmrdavid commented 6 months ago

@moldovangeorge: Just to build up context - is there a particular storage provider you're most interested in here? Say, if we improved this behavior just for Azure Storage but not for the other backends, would that work for your scenario? I'm trying to figure out if a scoped change like that would be easier to merge. As it stands, affecting all storage providers will require multiple stakeholders to chime in, and that coordination is tricky.

moldovangeorge commented 6 months ago

@davidmrdavid This is not fixable at the provider level. Without this change, all providers suffer from the same vulnerability: a serious performance degradation from TypeMissingException. Since the root cause for this is found in the DTF Core, there is no provider-level alternative for fixing this.

davidmrdavid commented 6 months ago

@moldovangeorge. I took a closer look at the code, and I see what you mean now. I'm just trying to reduce the coordination effort needed to merge this as, for example, the Netherite SME is not available at this time to chime in (on vacation) and they could probably be a blocker here.

A way to make this immediately safer to merge would be to turn this into opt-in behavior. Perhaps this can be a boolean setting that users opt into, but it is by default turned off. That would allow us to merge this behavior more easily, without having to worry much about implicitly changing the behavior of storage providers.

moldovangeorge commented 6 months ago

Hei @davidmrdavid, as long as the necessary people will eventually sign-off on this, I'm fine waiting, this has been open already for 9months, and at this point I would rather wait a little longer since transforming this into an opt-in behaviour might add some clutter in a critical part of the Core project.

davidmrdavid commented 6 months ago

Fair enough. I'll tag @sebastianburckhardt here for his 2 cents once he's back from OOF

moldovangeorge commented 3 months ago

Hei all, @davidmrdavid, @sebastianburckhardt, just trying to bump this for visibility, maybe we can conclude and close this. Thanks!

Azure / durabletask

Prevent performance degradation from TypeMissingException #925