Orchestration stuck in Pending state due to start / termination race condition

cgillum commented 4 years ago

Description

It is possible for an orchestration to get stuck in a Pending state permanently if it is started around the same time that termination or external event messages are delivered to the same instance. The conditions for this problem are as follows:

An orchestration with a particular ID (e.g. "abc") is in a completed or terminated state
Code tries to start a new orchestration with the same ID (e.g. "abc")
Code tries to terminate or send events to the "abc" orchestration around the same time

Low-level techical details The bug is a race condition in the DurableTask.AzureStorage nuget dependency. The "start" and "terminate" messages get processed together in a single batch. Because the existing instance is in a completed or terminated state, the system thinks it needs to discard the terminate message. However, it discards both the terminate and the start messages, leaving the instance stuck indefinitely.

Note that it's still possible to hit this race condition even if the start and terminate messages aren't sent at the same time. The important thing is that they are processed at the same time. So for example, if the host is not running and start and terminate messages are delivered a long time from each other, they still might get processed in the same batch after the host starts running.

Expected behavior

Orchestrations should never get stuck in the Pending state permanently.

Actual behavior

The race condition mentioned previously results in a Pending orchestration that cannot be resumed because there is no start message to resume it.

Known workarounds

There are a couple workarounds to consider:

[Prevention] Avoid trying to start and terminate orchestrations at the same time.
[Repair] If an orchestration is already stuck due to this race condition, manually update the {TaskHubName}Instances table in Azure Storage to change the instance status from Pending to Terminated. Then start the orchestration instance again using code.

App Details

Durable Functions extension version: v2.2.2
Azure Functions runtime version: Any
Programming language used: Any

davidmrdavid commented 3 years ago

@cgillum : just wondering if this issue is still ongoing? I would like to prioritize this if not

eanavalentin commented 3 years ago

@davidmrdavid It is still happening for us, yes. We have this issue in production, where we get a rapid fire of events which we can not avoid, since they are triggered by hardware we monitor. Every few days we have to connect to the {TaskHubName}Instances table and change the state of the stuck orchestrations from Pending to Terminated.

We have also tried to check if an orchestration is stuck in Pending, and TerminateAsync from code, however the GetStatusAsync still returns Pending after some minutes of waiting.

davidmrdavid commented 3 years ago

Hi @morphvale, sorry to hear that. Can you please create a new bug report issue in this repo and tag me on it? I've been looking at these "orchestration stuck" issues and, while we've made some progress, it seems some cases still escape us. That bug report would help us prioritize this, looking forward to it!

ConnorMcMahon commented 3 years ago

This should have been fixed in v2.5.0 with https://github.com/Azure/durabletask/pull/531.

Azure / azure-functions-durable-extension