Azure / durabletask

Durable Task Framework allows users to write long running persistent workflows in C# using the async/await capabilities.
Apache License 2.0
1.47k stars 287 forks source link

Add orchestrator scheduled actions check when suspended #1024

Closed nytian closed 5 months ago

nytian commented 5 months ago

This PR fixes issue #2519

When calling SuspendAsync URL, condition can happen that the orchestrator is suspended when there is already an action scheduled and added to the orchestratorActionsMap. However, our design will add any later event into a temporary queue during suspended time. Thus, this scheduled action will never be removed or completed and will be executed endless times until orchestrator is resumed. And this execution history will also be added to the History Table which will cause the non-deterministic error when orchestrator is resumed.

The race condition occurs when an ExecutionSuspended message is processed after TaskCompleted. Additionally, there is a subsequent activity task pending scheduling after the completion of this task. During the processingTaskCompleted, the result is sent back to the extension level, and thus, durable extension will go to the next line and then schedule a new task, adds it to the action map which will cause the issue.

For debug, we can verify if a task has been executed multiple times by checking if columnTaskEventId in Kusto or TaskScheduledId in History Table. Each task should have a unique identifier in these fields.

The fix of this PR is to guarantee orchestratorActionsMap is empty when orchestrator is suspended. Basically, we will check the orchestratorActionsMap when suspending an orchestrator. If orchestratorActionsMap is not null, the action will be saved to a temporary dictionary and will be added back to the orchestratorActionsMap when the orchestrator is resumed.

davidmrdavid commented 5 months ago

I know there's a "stress tests" project here that I believe is meant to catch race conditions like the bug you're fixing. I wonder if you can re-use that infrastructure to set up a test for suspend/resume. Do you might giving that a look?

nytian commented 5 months ago

I sync'ed w/ Naiyuan and, after the fact, we determined that we could have identified this bug by paying attention to the TaskScheduledId column of our telemetry. So I don't strictly think we need new logs anymore, just better team-wide awareness to pay attention to that source of information. Still, please get @bachuv's approval as well before merging.

Finally, @nytian, you wrote the following:

During the processingTaskCompleted, the result is sent back to the extension level, where SuspendAsync call is impossible to prevent the triggering of the next line. Thus, the extension will schedule a new activity and adds it to the action map which will cause the issue

Assuming I'm reading this correctly, I'm not certain that this statement is 100% accurate. In particular, DTFx.Core and the Extension run in the same thread so it's not quite right to say that "it's impossible [for DTFx] to prevent the triggering of the next line [of DF Extension user-code]." This is definitely possible so I think the explanation needs to be tweaked slightly :) . Let's discuss this offline later, it's just a nitpick

@davidmrdavid . Thanks for pointing out! I will update this part.