Add simple poison message handling for Azure Storage

davidmrdavid commented 2 months ago

Poison messages are a rare but destructive scenario where DTFx attempts to process a message infinitely and is somehow unable to make progress. This can create application instability and grow the control queue backlogs.

In those cases, we want to identify these "poison" messages and take them out of circulation, by putting the message on a "poison container" where the message can be manually reviewed and handled by the user, while also stopping further processing attempts.

This PR adds a simple poison message handling solution for orchestrator and activity messages. When an orchestrator or activity poison message is encountered (defined by having a DequeCount larger than 20, or some user-configured value), we place it on a new Azure Storage table called <taskhubName>-poison, which is used to hold poison messages and immediately deleted from the queue. This table is only created on demand, when a poison message is encountered.

From there, the consumer of that message is notified of the poison message. In the case of an orchestrator poison message, the orchestrator is terminated. In the case of an activity poison message, the activity is marked as failed, which in turn throws a catch-able exception at calling the orchestrator. The case for a poison message in Entities is unhandled - I'd appreciate guidance on how we think that should be handled, if at all.

davidmrdavid commented 2 months ago

As of the latest commit, poison activities are handled as well.

davidmrdavid commented 2 months ago

to figure out: what does an Entity poison message look like?

Azure / durabletask

Add simple poison message handling for Azure Storage #1063