Provide more accurate information about versioning for in-flight instances - part 2

Follow-up on issue https://github.com/MicrosoftDocs/azure-docs/issues/9152 I would be interested by comments of other developers with the versioning of their orchestrators to see what people are actually doing vs the recommendations from that document.

Thanks @cgillum for the changes, I want to add some more comments based on my usage of Durable Functions in production. Keep in mind that we have some long running orchestrators (months), and few short lifetime workflows (minutes/seconds). We do have multiple orchestrators per Function app. It feels like the scenarios we have running inside Durable Functions were appropriate.

I also think that providing insufficient guidance might lead to people considering the technology as complicated and isolate usage of Durable Functions to simple scenarios where they don't need to care about versioning at all (very short lifetime orchestrations). I also think that we need to keep also in mind that most people won't even think about the replay possible errors, because there are all kinds of programmers out there, that won't necessary go into the details of the internal working of the technology and think about the versioning at all. Durable Function is tricky, it's easy to get started with it so it gives you a lot of confidence in doing your scenarios, but it's also easy to break it which you won't realize until it explodes in production. At any time someone is doing a code modification they need to realize that they are in a piece of code part of an orchestrator to prevent breaking changes. Your QA also needs to be aware of those version change scenarios and test them ideally.

So, I think the first two mitigation strategies would be hard to use if the work flows are slightly important and are only good for "development only" scenarios in my view:

Do nothing: If you started those workflows and they are still running chances are that you care about them. Taking that approach would involve keeping a copy of the state of your workflow outside of your workflow so that you can restart failed instances where they were. That would also mean knowing which in-flight instances are currently running and subject to failure. I guess we could modify the orchestrator to catch the replay exception thrown when reaching a change in logic exception, but what to do after that to resume the workflow properly?
Stop all in-flight instances: again, if you care about your workflow, that doesn't fit the bill. If you care about them and restart them, you will need to keep your state outside of your workflow. This also means that your CI/CD pipeline needs to be aware of the breaking change and stop instances accordingly.
Side-by-side deployments:
- Deploy all the updates as a new function app with a different storage account --> That's a lot of DevOps overhead, and would involve duplicating other code as well (how the orchestrator is triggered, new URL,...). This is probably applicable only if your app only contains a single orchestrator and nothing else.
- Deploy a new copy of the function app with the same storage account but with an updated taskHub name--> This is equivalent of stopping the existing orchestrators, but worst because you just lost access to them and their data. This will also impact other orchestrators that were not touched. This make sense if your app only contains a single orchestrator and nothing else.
- Deploy all the updates as entirely new functions, leaving existing functions as-is. --> That's probably the most realistic scenario with minimal overhead and impacts, but lot of code duplication.

I will focus on side-by-side, but only for the "Deploy all the updates as entirely new functions, leaving existing functions as-is. " scenario, because I don’t see how we can apply the others without too much overhead or losing in-flight instances.

While I understand that side-by-side is the less risky way of handling versioning, it potentially also enforces a way of organizing the code and the logic. If the business logic is lightweight that can do the trick, but for complex/long workflows this is not necessarily appropriate unless you are comfortable doubling the code footprint of a scenario simply to change a zero for a one in your logic, for example.

When doing side-by-side, you will want to reuse part of the business logic in the new version of the big orchestrator, you will extract part of the logic from the previous version, but that refactoring will increase the chances of doing a modification to the previous version without realizing it. Believe me these are hard to catch bugs, because it will explode only the next time you activate the now broken orchestrator, which could be weeks after the code change, and by that time it’s too late because you have new instances using the same orchestrator code as well.

Currently, the strategy that we use to remove almost all the risks of breaking the orchestrator, is that we end-up delegating most business logic and decision making to activities so that the business logic results will be persisted, mainly because the smallest change to any logic in the orchestrator will break it. So, in the end we end-up with “dumb” orchestrators that are simply sequentially chaining activities with almost no logic outside the classic patterns (fan-out/fan-in) when required, passing the result from the previous activity to the next.

This also enforces us to always use activity specific complex types as input and output, so that we can have some kind of control and leverage persistence data contracts for any change in the output of an activity over time. The fact that changing the return type of an activity is a breaking change is not always obvious to developers.

I think it would also be great to have some kind of step by step “How to modify your code when working with orchestrators”.

Document Details

⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

ID: b735d777-d93c-8e8f-ee3f-2a10857c49a2
Version Independent ID: e65714b1-bbae-c0e2-1484-0e7d38a89758
Content: Versioning in Durable Functions - Azure
Content Source: articles/azure-functions/durable/durable-functions-versioning.md
Service: azure-functions
GitHub Login: @cgillum
Microsoft Alias: azfuncdf

Azure / azure-functions-durable-extension

Provide more accurate information about versioning for in-flight instances - part 2 #984

Document Details