Open SimonLuckenuik opened 6 years ago
Related documentation issue I just raised: https://github.com/MicrosoftDocs/azure-docs/issues/9152
Is it safe to summarize this ask as a detailed "step-by-step guide for side-by-side deployments"? @TsuyoshiUshio this might be a good topic for you.
I have a great interest in this topic as well since I want to explain this to clients. @TsuyoshiUshio please let me know if I can help with writing or coding concepts. Your blog is already a good starting point I think.
@cgillum, sounds like what I am looking after! I am also looking after "Dos and Don'ts/Best Practices/Common Pitfalls" to minimize impact of those versions or completely prevent breaking changes. Example: for the activities inputs and outputs if we manage them with complex entities instead of value types, it is easier to prevent breaking changes of signature changes.
Good point on best practices (or "tips and tricks" as I sometimes like to think of them). I agree that we need all of this.
Related documentation issue: https://github.com/Azure/azure-functions-durable-extension/issues/184
Hi @marcduiker @SimonLuckenuik ,
I've done two hackfest relelated this topics. I'd happy to contribute writing Durable Functions DevOps documentation.
Currently I wrote one blog. This solution is suitable for customer has long running process scenario. Using Event Grid notifications.
Event Grid Publishing https://docs.microsoft.com/en-us/azure/azure-functions/durable-functions-event-publishing
Safe blue-green deployment with durable functions https://medium.com/@tsuyoshiushio/safe-blue-green-deployment-with-durable-functions-905a1cda0450
Also, I've done one PR for enable us to query all instances status. It helps to check if there is working instances for safe deployment. This feature is going to be merged at the next release.
https://github.com/Azure/azure-functions-durable-extension/pull/323
The usage is something like this.
[FunctionName("GetAllStatus")]
public static async Task Run(
[HttpTrigger(AuthorizationLevel.Anonymous, "get", "post")]HttpRequestMessage req,
[OrchestrationClient] DurableOrchestrationClient client,
TraceWriter log)
{
var statuses = await client.GetStatusAsync(); // You can pass CancellationToken as a parameter.
// do something based on the retrun statuses
}
For Functions 1.0, the request format is as follows:
GET /admin/extensions/DurableTaskExtension/instances/?taskHub={taskHub}&connection={connection}&code={systemKey}
The Functions 2.0 format has all the same parameter but has a slightly different URL prefix:
GET /runtime/webhooks/DurableTaskExtension/instances/?taskHub={taskHub}&connection={connection}&code={systemKey}&showHistory={showHistory}&showHistoryOutput={showHistoryOutput}
Thank you for the input @TsuyoshiUshio.
Do I understand that the proposed "safe" solution is to make sure that no Activity running in the system before deploying again? This is a bit hard to enforce, let's say that I have an Orchestrator with a timer, at any time that timer could trigger while deploying and break everything? For a high volume scenario, it might be very difficult to make that work...
In the article, there is nothing about status of the Orchestrator while upgrading. Can Orchestrator be "Running" while doing the upgrade (no activity executing, but some activities remaining in the workflow)? If I change the HubName, as suggested, I am assuming that any Orchestrator still "running" will be lost forever (the durable framework will use different storage for it's metadata)?
How long is a "long running process" that you are referring to? Depending on your answer for above statements, if the Orchestrator cannot be "running" than is not well suited for any Orchestrator involving more than a few seconds (max minutes) of execution time, otherwise it means that I would need to wait few days for some orchestrations to complete.
Other suggestions:
Maybe adding something about Deployment Slot usage would be interesting. What happen if the staging slot is in concurrency with the prod slot?
Anyway to disable durable function in the slot to prevent having old code being executed in concurrency with newer code in prod slot?
@cgillum considering that this is out of preview, I guess that some customers are using that in Production, what are common DevOps scenario you have heard for Durable Functions to prevent any issue (100% safe that everything is executed and nothing is lost)?
@TsuyoshiUshio can speak to that better than I can since he is working directly with some of these customers in Japan, but right now the main approach being used is the Azure Event Grid integration to track orchestration lifecycle across multiple task hubs (which is described above). We've also spoken to customers that have less aggressive requirements, and for them we're creating a REST API that can enumerate the list of all orchestrations in a task hub as a simpler (though less scaleable) solution.
@TsuyoshiUshio is going to put together comprehensive walkthrough documentation which outlines some of the end-to-end mechanisms for implementing DevOps with Durable Functions, and it will also cover these versioning scenarios.
Sorry for being late reply. Yes. I'll try to do it. So your use case is very welcome.
@SimonLuckenuik In your case, maybe we need new feature to stop accepting new request feature might be needed. Let's keep on discuss on #184 . :) If there is no feature on the Durable, I'd happy to contribute to implement that.
Actually, if you want to upgrade the app, if it is ok if you don't change the orchestrator or activity function interface. however if you change one of these, you need to make sure there is no on-the-fly instances. Since the orchestrator replay according to the storage table. If you change the orchestrator or activity function interfaces, the record of storage table will be unmatch for the new versions.
In short, for the safe deplyment, we need to make sure
We can check if there is change for orchestrator / activity function interfaces, however, the pipeline might become complex. For archiving 1. we need to wait finishing the current on-the-fly instances. also, we need to stop accepting the new request. If you want to stop the instance in the middle of the execution, if your functions are idempotent through the orchestration, you might just stop it and replay it. (there is no future for these processes). This is my basic idea.
@SimonLuckenuik
This issue has been open for a while, and I just want to make sure that we understand what the ask here is.
It seems like the ask is to add more explicit/concrete instructions on how to implement side-by-side versioning, and to make our versioining docs a bit more clear that this is the most highly recommended scenario?
One thing to note here is that we need to convey the pros/cons of each approach (with more concrete exceptions). It's also worth noting that the introduction of entities changes the calculus of having separate taskhubs for side-by-side deployements...
Bumping this topic - Could the RideSharing sample be used as inspiration for versioning discussions? It's a concrete example with potentially frequently-running orchestrations and entities containing application-critical state. I'm currently at a loss as to how to update an application with live orchestration and entity functions without blocking the client from submitting new orchestrations and then letting in-flight orchestrations run to completion, while maintaining a single representation of entities (i.e., not a side-by-side deployment).
I just went through this document: https://docs.microsoft.com/en-us/azure/azure-functions/durable-functions-versioning and the documentation is very light. All solutions sound like data is being lost/pending, except for Function naming
What is the proper strategy for versioning? That seems to be a complex topic and the documentation is very light with no complex example. The mitigation strategies suggested are probably not applicable to most people: "Do nothing" and "Stop all in-flight instances". The goal of creating a stateful workflow is to have it running for a long time, so probably doing nothing and stopping all instances is not appropriate.
Side-by-side deployments:
Could you please elaborate on versioning with specific examples and tutorial / samples ?
This is a very important topic, and I expect that figuring out what is happening in case of an improper versioning will be difficult to track / detect.
Thanks! Simon