Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
711 stars 263 forks source link

Non-deterministic workflow failure from slot deployment #2696

Open mitchellrust opened 6 months ago

mitchellrust commented 6 months ago

Description

Microsoft Support Case Tracking ID: 2310310040011442

This was previously opened as bug #2635 but it was auto-closed.

We have seen occasional failures in our eternal durable orchestrations during deployment to our function apps, using slot deployments. This does not occur on every deployment, but we've seen it repeated multiple times.

The error message is as follows:

Non-Deterministic workflow detected: A previous execution of this orchestration scheduled an activity task with sequence ID 0 and name 'DetectAndPublishSalesOrderEvents' (version ''), but the current replay execution hasn't (yet?) scheduled this task. Was a change made to the orchestrator code after this instance had already started running?

Here is the scenario that we see this happening in:

Based on our scenario and what the logs have shown us, it seems that our stage slot is resuming orchestrations somehow, and naturally can't find the history for the orchestration because that all exists in the production slot task hub. This seems to be nuance specific to using the slot deployments, from what we can tell.

Expected behavior

A slot swap during deployment does not resume orchestrations and their related request-messages within the stage slot.

Actual behavior

After the slot swap has occurred during deployment, the stage slot resumes eternal orchestrations that have been running in the production slot, previous to the slot deployment. As expected, the stage slot cannot find the orchestration history, and fails with Non-Deterministic workflow detected error.

Relevant source code snippets

N/A

Known workarounds

None. Microsoft support has communicated that we should stop all running orchestrations before executing a slot deployment, however this is counter to our use of eternal orchestrations and is not deemed a workaround.

App Details

Screenshots

production slot app configuration, showing task hub sticky deployment setting. image

stage slot app configuration, showing task hub sticky deployment setting. image

If deployed to Azure

bachuv commented 6 months ago

Hi @mitchellrust, have you had a chance to review our Zero downtime deployment docs? It covers scenarios when using slots is recommended. If you have long running orchestrations, you can try the application routing strategy. Let us know if you have any questions!

drdamour commented 6 months ago

@bachuv yes we've read through that quite a bit. Does the application routing strategy apply to eternal orchestration singletons? it seems to apply to incoming requests to launch orchestrations...but that's not the case with eternal orchestrations, they just keep running. It also seems to really focus on breaking changes, but in this case we are NOT deploying breaking changes, yet the eternal orchestration crashes.

bcrispcvna commented 5 months ago

We're running into the same issue while moving a function app from .NET 6 in-proc to .NET 8 isolated. We also have a stage/production slot and all of our in-flight eternal orchestrations will fail during the swap with a similar exception.

This is currently blocking us from moving forward with the upgrade. I'm going to try a workaround for now to disable our orchestration function(s) on the prod slot prior to the swap and re-enabling it afterwards. It's not a preferred scenario so hopefully this is something that can be addressed.

@mitchellrust @bachuv is this something that you've tried already?

mitchellrust commented 5 months ago

@bcrispcvna this isn't something that we've tried, we've just been restarting any of the orchestrations that randomly fail when performing the slot swap. We are hoping to not build too much process around a workaround in hopes that a proper fix can be put in place for the bug.

nytian commented 4 months ago

Hey @mitchellrust If I understand correctly, production slot and stage slot has its own Task Hub which won't be swapped. I wonder, is there any orchestrator instances that is in Pending or Running state at stage slot's TaskHub when the swap happened? If there is, that would cause the instances resume running. And if the orchestrator logic changes, that would trigger the non-deterministic exception.

drdamour commented 4 months ago

@nytian theres no change in orchestrator logic