Non-deterministic workflow failure from slot deployment

mitchellrust commented 6 months ago

Description

Microsoft Support Case Tracking ID: 2310310040011442

This was previously opened as bug #2635 but it was auto-closed.

We have seen occasional failures in our eternal durable orchestrations during deployment to our function apps, using slot deployments. This does not occur on every deployment, but we've seen it repeated multiple times.

The error message is as follows:

Non-Deterministic workflow detected: A previous execution of this orchestration scheduled an activity task with sequence ID 0 and name 'DetectAndPublishSalesOrderEvents' (version ''), but the current replay execution hasn't (yet?) scheduled this task. Was a change made to the orchestrator code after this instance had already started running?

Here is the scenario that we see this happening in:

Our function app uses slot deployments to minimize downtime during deployments, with two slots: stage and production.
During said deployments where this issue occurs, no changes to orchestrations or activities have been made.
Each of our slots have their own task hubs, with their own control queues and tables. These hubs are contained in the same Azure Storage Account. We've defined this in our deployment app settings for the production and stage slots individually, using the AzureFunctionsJobHost__extensions__durableTask__hubName sticky app setting for each.
During a slot deployment, there could be any number of orchestrations in the middle of executing on our production slot, or waiting for scheduled activity tasks to complete before resuming. We utilize eternal orchestrations heavily for our app.
- We have seen this failure exclusively when an orchestration is awaiting the completion of a scheduled activity task.
During the "Slot Swap" step of our slot deployment, we have seen in logs that orchestrations awaiting activity completions get resumed in the stage slot. This doesn't make a ton of sense, as we wouldn't expect any tasks to be scheduled on the stage slot control queues, but perhaps something in the app-settings switch causes things to cross over here.

Based on our scenario and what the logs have shown us, it seems that our stage slot is resuming orchestrations somehow, and naturally can't find the history for the orchestration because that all exists in the production slot task hub. This seems to be nuance specific to using the slot deployments, from what we can tell.

Expected behavior

A slot swap during deployment does not resume orchestrations and their related request-messages within the stage slot.

Actual behavior

After the slot swap has occurred during deployment, the stage slot resumes eternal orchestrations that have been running in the production slot, previous to the slot deployment. As expected, the stage slot cannot find the orchestration history, and fails with Non-Deterministic workflow detected error.

Relevant source code snippets

N/A

Known workarounds

None. Microsoft support has communicated that we should stop all running orchestrations before executing a slot deployment, however this is counter to our use of eternal orchestrations and is not deemed a workaround.

App Details

Durable Functions extension version (e.g. v1.8.3): Microsoft.Azure.WebJobs.Extensions.DurableTask v2.9.0
Azure Functions runtime version (1.0 or 2.0): Runtime Version v4.27.5.21549
Programming language used: C# Dotnet 6.0

Screenshots

production slot app configuration, showing task hub sticky deployment setting.

stage slot app configuration, showing task hub sticky deployment setting.

If deployed to Azure

Timeframe issue observed: 2023-12-07T21:06:15.4034258Z to 2023-12-07T21:08:21.3668316Z
Function App name: gsp01fap-jde-na-ext-prod
Function name(s): MonitorSalesOrderChanges (orchestration), DetectAndPublishSalesOrderEvents (activity)
Azure region: West US
Orchestration instance ID(s): monitor$MonitorSalesOrderChanges$2a6408375de940aaa756f613c6ecd42e
Azure storage account name: gsp01strjdenaext
```
### Tasks
```

bachuv commented 6 months ago

Hi @mitchellrust, have you had a chance to review our Zero downtime deployment docs? It covers scenarios when using slots is recommended. If you have long running orchestrations, you can try the application routing strategy. Let us know if you have any questions!

drdamour commented 6 months ago

@bachuv yes we've read through that quite a bit. Does the application routing strategy apply to eternal orchestration singletons? it seems to apply to incoming requests to launch orchestrations...but that's not the case with eternal orchestrations, they just keep running. It also seems to really focus on breaking changes, but in this case we are NOT deploying breaking changes, yet the eternal orchestration crashes.

bcrispcvna commented 5 months ago

We're running into the same issue while moving a function app from .NET 6 in-proc to .NET 8 isolated. We also have a stage/production slot and all of our in-flight eternal orchestrations will fail during the swap with a similar exception.

This is currently blocking us from moving forward with the upgrade. I'm going to try a workaround for now to disable our orchestration function(s) on the prod slot prior to the swap and re-enabling it afterwards. It's not a preferred scenario so hopefully this is something that can be addressed.

@mitchellrust @bachuv is this something that you've tried already?

mitchellrust commented 5 months ago

@bcrispcvna this isn't something that we've tried, we've just been restarting any of the orchestrations that randomly fail when performing the slot swap. We are hoping to not build too much process around a workaround in hopes that a proper fix can be put in place for the bug.

nytian commented 4 months ago

Hey @mitchellrust If I understand correctly, production slot and stage slot has its own Task Hub which won't be swapped. I wonder, is there any orchestrator instances that is in Pending or Running state at stage slot's TaskHub when the swap happened? If there is, that would cause the instances resume running. And if the orchestrator logic changes, that would trigger the non-deterministic exception.

drdamour commented 4 months ago

@nytian theres no change in orchestrator logic

Azure / azure-functions-durable-extension