Poison messages are a rare but destructive scenario where DTFx attempts to process a message infinitely and is somehow unable to make progress. This can create application instability and grow the control queue backlogs.
In those cases, we want to identify these "poison" messages and take them out of circulation, by putting the message on a "poison container" where the message can be manually reviewed and handled by the user, while also stopping further processing attempts.
This PR adds a simple poison message handling solution for orchestrator and activity messages. When an orchestrator or activity poison message is encountered (defined by having a DequeCount larger than 20, or some user-configured value), we place it on a new Azure Storage table called <taskhubName>-poison, which is used to hold poison messages and immediately deleted from the queue. This table is only created on demand, when a poison message is encountered.
From there, the consumer of that message is notified of the poison message.
In the case of an orchestrator poison message, the orchestrator is terminated.
In the case of an activity poison message, the activity is marked as failed, which in turn throws a catch-able exception at calling the orchestrator.
The case for a poison message in Entities is unhandled - I'd appreciate guidance on how we think that should be handled, if at all.
Poison messages are a rare but destructive scenario where DTFx attempts to process a message infinitely and is somehow unable to make progress. This can create application instability and grow the control queue backlogs.
In those cases, we want to identify these "poison" messages and take them out of circulation, by putting the message on a "poison container" where the message can be manually reviewed and handled by the user, while also stopping further processing attempts.
This PR adds a simple poison message handling solution for orchestrator and activity messages. When an orchestrator or activity poison message is encountered (defined by having a DequeCount larger than 20, or some user-configured value), we place it on a new Azure Storage table called
<taskhubName>-poison
, which is used to hold poison messages and immediately deleted from the queue. This table is only created on demand, when a poison message is encountered.From there, the consumer of that message is notified of the poison message. In the case of an orchestrator poison message, the orchestrator is terminated. In the case of an activity poison message, the activity is marked as failed, which in turn throws a catch-able exception at calling the orchestrator. The case for a poison message in Entities is unhandled - I'd appreciate guidance on how we think that should be handled, if at all.