Closed cgillum closed 3 years ago
@cgillum : just wondering if this issue is still ongoing? I would like to prioritize this if not
@davidmrdavid It is still happening for us, yes. We have this issue in production, where we get a rapid fire of events which we can not avoid, since they are triggered by hardware we monitor. Every few days we have to connect to the {TaskHubName}Instances table and change the state of the stuck orchestrations from Pending to Terminated.
We have also tried to check if an orchestration is stuck in Pending, and TerminateAsync from code, however the GetStatusAsync still returns Pending after some minutes of waiting.
Hi @morphvale, sorry to hear that. Can you please create a new bug report issue in this repo and tag me on it? I've been looking at these "orchestration stuck" issues and, while we've made some progress, it seems some cases still escape us. That bug report would help us prioritize this, looking forward to it!
This should have been fixed in v2.5.0 with https://github.com/Azure/durabletask/pull/531.
Description
It is possible for an orchestration to get stuck in a Pending state permanently if it is started around the same time that termination or external event messages are delivered to the same instance. The conditions for this problem are as follows:
Low-level techical details The bug is a race condition in the DurableTask.AzureStorage nuget dependency. The "start" and "terminate" messages get processed together in a single batch. Because the existing instance is in a completed or terminated state, the system thinks it needs to discard the terminate message. However, it discards both the terminate and the start messages, leaving the instance stuck indefinitely.
Note that it's still possible to hit this race condition even if the start and terminate messages aren't sent at the same time. The important thing is that they are processed at the same time. So for example, if the host is not running and start and terminate messages are delivered a long time from each other, they still might get processed in the same batch after the host starts running.
Expected behavior
Orchestrations should never get stuck in the Pending state permanently.
Actual behavior
The race condition mentioned previously results in a Pending orchestration that cannot be resumed because there is no start message to resume it.
Known workarounds
There are a couple workarounds to consider:
{TaskHubName}Instances
table in Azure Storage to change the instance status from Pending to Terminated. Then start the orchestration instance again using code.App Details