Open mitchellrust opened 6 months ago
Hi @mitchellrust, have you had a chance to review our Zero downtime deployment docs? It covers scenarios when using slots is recommended. If you have long running orchestrations, you can try the application routing strategy. Let us know if you have any questions!
@bachuv yes we've read through that quite a bit. Does the application routing strategy apply to eternal orchestration singletons? it seems to apply to incoming requests to launch orchestrations...but that's not the case with eternal orchestrations, they just keep running. It also seems to really focus on breaking changes, but in this case we are NOT deploying breaking changes, yet the eternal orchestration crashes.
We're running into the same issue while moving a function app from .NET 6 in-proc to .NET 8 isolated. We also have a stage/production slot and all of our in-flight eternal orchestrations will fail during the swap with a similar exception.
This is currently blocking us from moving forward with the upgrade. I'm going to try a workaround for now to disable our orchestration function(s) on the prod slot prior to the swap and re-enabling it afterwards. It's not a preferred scenario so hopefully this is something that can be addressed.
@mitchellrust @bachuv is this something that you've tried already?
@bcrispcvna this isn't something that we've tried, we've just been restarting any of the orchestrations that randomly fail when performing the slot swap. We are hoping to not build too much process around a workaround in hopes that a proper fix can be put in place for the bug.
Hey @mitchellrust If I understand correctly, production slot and stage slot has its own Task Hub which won't be swapped. I wonder, is there any orchestrator instances that is in Pending or Running state at stage slot's TaskHub when the swap happened? If there is, that would cause the instances resume running. And if the orchestrator logic changes, that would trigger the non-deterministic exception.
@nytian theres no change in orchestrator logic
Description
Microsoft Support Case Tracking ID:
2310310040011442
This was previously opened as bug #2635 but it was auto-closed.
We have seen occasional failures in our eternal durable orchestrations during deployment to our function apps, using slot deployments. This does not occur on every deployment, but we've seen it repeated multiple times.
The error message is as follows:
Non-Deterministic workflow detected: A previous execution of this orchestration scheduled an activity task with sequence ID 0 and name 'DetectAndPublishSalesOrderEvents' (version ''), but the current replay execution hasn't (yet?) scheduled this task. Was a change made to the orchestrator code after this instance had already started running?
Here is the scenario that we see this happening in:
stage
andproduction
.production
andstage
slots individually, using theAzureFunctionsJobHost__extensions__durableTask__hubName
sticky app setting for each.production
slot, or waiting for scheduled activity tasks to complete before resuming. We utilize eternal orchestrations heavily for our app.stage
slot. This doesn't make a ton of sense, as we wouldn't expect any tasks to be scheduled on thestage
slot control queues, but perhaps something in the app-settings switch causes things to cross over here.Based on our scenario and what the logs have shown us, it seems that our
stage
slot is resuming orchestrations somehow, and naturally can't find the history for the orchestration because that all exists in theproduction
slot task hub. This seems to be nuance specific to using the slot deployments, from what we can tell.Expected behavior
A slot swap during deployment does not resume orchestrations and their related request-messages within the
stage
slot.Actual behavior
After the slot swap has occurred during deployment, the
stage
slot resumes eternal orchestrations that have been running in theproduction
slot, previous to the slot deployment. As expected, thestage
slot cannot find the orchestration history, and fails withNon-Deterministic workflow detected
error.Relevant source code snippets
N/A
Known workarounds
None. Microsoft support has communicated that we should stop all running orchestrations before executing a slot deployment, however this is counter to our use of eternal orchestrations and is not deemed a workaround.
App Details
Screenshots
production
slot app configuration, showing task hub sticky deployment setting.stage
slot app configuration, showing task hub sticky deployment setting.If deployed to Azure
2023-12-07T21:06:15.4034258Z
to2023-12-07T21:08:21.3668316Z
gsp01fap-jde-na-ext-prod
MonitorSalesOrderChanges
(orchestration),DetectAndPublishSalesOrderEvents
(activity)West US
monitor$MonitorSalesOrderChanges$2a6408375de940aaa756f613c6ecd42e
gsp01strjdenaext