Azure / durabletask

Durable Task Framework allows users to write long running persistent workflows in C# using the async/await capabilities.
Apache License 2.0
1.47k stars 287 forks source link

AzureStorage: purge corrupted queue messages. #1088

Open jviau opened 1 month ago

jviau commented 1 month ago

We should evaluate updating these two code locations to delete corrupted (fails deserialization) messages from the queue. It is not expected that deserialization failures is a transient issue and no amount of retries / time delay will fix these messages. Particularly because it is only framework types (and not user) being deserialized here.

Location 1: https://github.com/Azure/durabletask/blob/b4ec695dc5c51319b99c20557f9d47a1dd518729/src/DurableTask.AzureStorage/Messaging/ControlQueue.cs#L108-L110

Location 2:

https://github.com/Azure/durabletask/blob/b4ec695dc5c51319b99c20557f9d47a1dd518729/src/DurableTask.AzureStorage/Messaging/WorkItemQueue.cs#L47-L49

cgillum commented 1 month ago

The one caveat to this policy is that we've seen cases where changes to Newtonsoft.Json settings can cause unintended deserialization failures. This can happen as part of a rollout of a new version of an app, whether due to changes made by the user (though hopefully we've rooted all those possibilities out) or changes made by the DTFx maintainers. Either way, giving time for users to roll back the change, e.g. 24 hours, before permanently deleting their data, might be prudent.

jviau commented 1 month ago

Yeah will need some design. It could be an opt-in setting? Or a callback? User gets the exception and gets to return true/false for purge?

Either way, the framework needs to take action here as it is not something users can self-mitigate (they will be fighting with the workers to dequeue and delete the message!)