Open SimonLuckenuik opened 5 years ago
I have mostly the same sentiments as @SimonLuckenuik
Something else to add. When using the WaitForExternalEvent
API, versioning becomes even harder because your external system which raises the event must also change, as it can't continue to direct the events to the old orchestrations, which likely means you need to hit a new endpoint on the Function App from the external system - Obviously not an impossible situation, just more versioning toil that ripples beyond the bound of the DF App.
I can also concur that side-by-side is the most viable if you are running critical systems where orchestrations can't be dropped mid-flight, or the orchestrations have a lifetime greater than a few seconds.
Follow-up on issue https://github.com/MicrosoftDocs/azure-docs/issues/9152 I would be interested by comments of other developers with the versioning of their orchestrators to see what people are actually doing vs the recommendations from that document.
Thanks @cgillum for the changes, I want to add some more comments based on my usage of Durable Functions in production. Keep in mind that we have some long running orchestrators (months), and few short lifetime workflows (minutes/seconds). We do have multiple orchestrators per Function app. It feels like the scenarios we have running inside Durable Functions were appropriate.
I also think that providing insufficient guidance might lead to people considering the technology as complicated and isolate usage of Durable Functions to simple scenarios where they don't need to care about versioning at all (very short lifetime orchestrations). I also think that we need to keep also in mind that most people won't even think about the replay possible errors, because there are all kinds of programmers out there, that won't necessary go into the details of the internal working of the technology and think about the versioning at all. Durable Function is tricky, it's easy to get started with it so it gives you a lot of confidence in doing your scenarios, but it's also easy to break it which you won't realize until it explodes in production. At any time someone is doing a code modification they need to realize that they are in a piece of code part of an orchestrator to prevent breaking changes. Your QA also needs to be aware of those version change scenarios and test them ideally.
So, I think the first two mitigation strategies would be hard to use if the work flows are slightly important and are only good for "development only" scenarios in my view:
Do nothing: If you started those workflows and they are still running chances are that you care about them. Taking that approach would involve keeping a copy of the state of your workflow outside of your workflow so that you can restart failed instances where they were. That would also mean knowing which in-flight instances are currently running and subject to failure. I guess we could modify the orchestrator to catch the replay exception thrown when reaching a change in logic exception, but what to do after that to resume the workflow properly?
Stop all in-flight instances: again, if you care about your workflow, that doesn't fit the bill. If you care about them and restart them, you will need to keep your state outside of your workflow. This also means that your CI/CD pipeline needs to be aware of the breaking change and stop instances accordingly.
Side-by-side deployments:
I will focus on side-by-side, but only for the "Deploy all the updates as entirely new functions, leaving existing functions as-is. " scenario, because I don’t see how we can apply the others without too much overhead or losing in-flight instances.
While I understand that side-by-side is the less risky way of handling versioning, it potentially also enforces a way of organizing the code and the logic. If the business logic is lightweight that can do the trick, but for complex/long workflows this is not necessarily appropriate unless you are comfortable doubling the code footprint of a scenario simply to change a zero for a one in your logic, for example.
When doing side-by-side, you will want to reuse part of the business logic in the new version of the big orchestrator, you will extract part of the logic from the previous version, but that refactoring will increase the chances of doing a modification to the previous version without realizing it. Believe me these are hard to catch bugs, because it will explode only the next time you activate the now broken orchestrator, which could be weeks after the code change, and by that time it’s too late because you have new instances using the same orchestrator code as well.
Currently, the strategy that we use to remove almost all the risks of breaking the orchestrator, is that we end-up delegating most business logic and decision making to activities so that the business logic results will be persisted, mainly because the smallest change to any logic in the orchestrator will break it. So, in the end we end-up with “dumb” orchestrators that are simply sequentially chaining activities with almost no logic outside the classic patterns (fan-out/fan-in) when required, passing the result from the previous activity to the next.
This also enforces us to always use activity specific complex types as input and output, so that we can have some kind of control and leverage persistence data contracts for any change in the output of an activity over time. The fact that changing the return type of an activity is a breaking change is not always obvious to developers.
I think it would also be great to have some kind of step by step “How to modify your code when working with orchestrators”.
Document Details
⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.