DevOps guide for versioning

cgillum commented 6 years ago

This work item tracks creating a dev-ops guide for versioning orchestrations. It should compliment the existing versioning documentation by providing guidance on how to build automation using tools such as slots, deployments, app settings, etc.

TsuyoshiUshio commented 6 years ago

Sure. I'll take this task. :)

TsuyoshiUshio commented 6 years ago

Hi @cgillum, I consider the possible CI/CD pipeline ideas. The concept is slight complex. That is why I wrote a blog about it to start discuss and share the idea with some diagrams. It might help to start discussion. Could you give me some comments for that? Especially, If it possible to use the first strategy if it has not breaking change.

https://medium.com/@tsuyoshiushio/durable-functions-blue-green-deployment-strategies-ed25509ecd60

TsuyoshiUshio commented 6 years ago

I start implement these ideas from today.

TsuyoshiUshio commented 6 years ago

@cgillum I create a long running activity (30 min) and I tried the deployment the first strategy of my blog (deploy with non-breaking change) then update the code without the breaking change, then deploy the new version. It cause lock then I restart the Function App. It doesn't seem to work as well. (because the activity is on the fly. and New dll reloaded, however orchestrator already send message to the activity. Which meang, as your documentation, we can only use breaking change strategy for the deployment. (changing taskhub ( or other storage account) and deploy to the new slot, and keep current function app).

If I use that strategy, I'd like to know if the all the current tasks has been finished. What is the simplest way to know that? Search the Execution completed on the Storage Table? Or Record instanceId when the OrchestratorClient has been called and ask the status from the OrchestratorClient?

C:\Program Files\dotnet\sdk\2.1.100\Sdks\Microsoft.NET.Sdk.Publish\build\netstandard1.0\PublishTargets\Microsoft.NET.Sdk.Publish.MSDeploy.targets(139,5): error : Web deployment task failed. (Web Deploy cannot modify the file 'DurableDeployment.dll' on the destination because it is locked by an external process.  In order to allow the publish operation to succeed, you may need to either restart your application to release the lock, or use the AppOffline rule handler for .Net applications on your next publish attempt.  Learn more at: http://go.microsoft.com/fwlink/?LinkId=221672#ERROR_FILE_IN_USE.) [C:\Users\tsushi\source\repos\DurableDeployment\DurableDeployment\DurableDeployment.csproj]
  Publish failed to deploy.

cgillum commented 6 years ago

Ah, it's interesting that the deployment fails. I don't think that is specific to Durable Functions though. Wouldn't you have similar errors for any function app? What do customers normally do for in-place updates?

To bypass this issue for the bug-fix scenario, an alternative strategy would be to stop the function app, do the deployment, and then start it again. There should be no data loss in this case. However, the risk for the customer is that the long-running activity function will execute again from the beginning. Because of this, it is important that the activity function code is idempotent (this is true for other trigger types as well, including Azure Queue, Event Hubs, and Service Bus triggers).

To answer your question about the breaking change strategy, the simplest way to know if all in-flight activity functions have completed is to check the instance history. The best way to do this is using the updated instance query APIs that Kanio created (code or webhook).

TsuyoshiUshio commented 6 years ago

Thank you @cgillum, lock problem is not big deal. I just share what happened. I deployed from VS. If I try other deployment strategy, it automatically restart function app. As for the instance query API, I understand it is the simplest, however, how can I get all instances id ? Do you have query for that? if not, I might be able to contribute.

TsuyoshiUshio commented 6 years ago

Now I'm discussing with @gled4er. There is not api for fetching all instance ids. We have two ways.

Contribute to write a feature to fetch all instance ids or to fetch all instance which is still on the fly.
Using Application Insight API

If 1 is good, I'd like to contribute it.

gled4er commented 6 years ago

Hello @TsuyoshiUshio ,

I think both options are good. I am a bit worried for the first one since in my opinion currently you will need to get the whole History table or Instances table since the primary key is the Instance ID and then you need to filter in memory which can be very slow operation if you have many executions of the the Durable Function. That is I why I suggested to work with the Application Insights Query REST API. I think currently the way we store the history events and the instances information is not well suited for get all instance id-s query.

Let me know what you think.

Thank you!

TsuyoshiUshio commented 6 years ago

Hi @gled4er It is totally make sense. It might read all tables data. If we go with the first strategy, we might need to create some kind of index table which include instanceid and status. Then update the record the data when it Start/Finish, or create a cron based bach function then replay the table storage and create the index table Hmm... Any comments, @cgillum ?

gled4er commented 6 years ago

Hello @TsuyoshiUshio ,

We already created similar Instances table but the problem is that the partition key is the instance id. I think what you need is another table that has as a partition key with the runtime status and the instance id for the row id. Then you can query the instances with running, pending and other not completed statuses. If the purpose of this table is to help you decide if there are any not completed instances, we can even introduce aggregated status for not completed executions so you will make only a single partition call to see if there are any instances currently in processing. Disadvantage of this approach is that you will end up with multiple rows for the same execution.

Another option is to have a table only for instances that have runtime status different than the completed ones so you can just check if this table has any entries. This will assume that we need to clean after instances that are completed.

The main thing I am worried about with both of these approaches is that we need to update every time these tables when we have execution status change so this will affect the performance of Durable Functions for use case that is not happening very often. So I think it is a trade-off. That is why I think that the App Insights Query API we will be better suited for this use if works are we expect.

I am looking forward to hearing @cgillum's opinion.

Thank you!

cgillum commented 6 years ago

I worry about trying to maintaining multiple Azure Storage tables. There will be a non-trivial amount of performance overhead, and it may be difficult to ensure that all tables remain consistent. For that reason, I would like to stay away from this option. Interestingly, if we were using a provider like CosmosDB, then we would have an easy solution since CosmosDB supports multiple indexes. :)

Using Application Insights is generally what I'm recommending, but there is a concern in my mind about data retention policies. For example, what happens if an orchestration remains idle for more than 30 days? Will it still show up in the query somehow? We would need to think through this carefully.

Another option is to build the features that would allow a customer to implement this themselves. For example, customer can control the instance IDs themselves, and can also learn the IDs of orchestration instances that they start. A customer could theoretically insert these into their own database and create their own tools for tracking status. This remind me of a feature ask from @yu_ka1984 for notifications when an orchestration completes. Something like this might be helpful as well for building this feature.

gled4er commented 6 years ago

Hello @cgillum ,

Thank you for the great insights!

I was thinking the same about Cosmos DB. We now have increasing number of reasons to go after it! Exciting!

In terms of Application Insights - we have default retention policy for raw and aggregated data of 90 days for both basic and enterprise tiers as shown here. And if this is not enough developers can export their data to blobs and auto-index it with Azure Search for similar to the Application Insights Query API experience.

The feature request for Orchestration completion notification sounds very interesting, but this also means that we need to keep in an external database mapping between started and completed executions.

For me it looks that the fastest way will be to use Application Insights.

Thank you!

TsuyoshiUshio commented 6 years ago

Actually, @yu_ka1984 is related this project. :) I'll share with you in other place. :) Anyway, Thank you for the ideas. It is very helpful. Thank you @cgillum. We'll implement the solution next week. Detecting the orchestration complete is very important for the CD. Also, I want something that customer can re-use the solution is the best. Even if I spent some time for it. I'll do some experiment to think both ideas. :)

TsuyoshiUshio commented 6 years ago

I have a look the Durable code. I can imagine App Insights implementation. so I start with the second one. We have no way to know the FunctionComplete, except for reading whole Storage Table data.

Simple idea of the implementation is
If the customer set the specific AppSettings, FunctionCompleted method with FunctionType = Orchestrator emit an event (e.g. EventGrid or Storage Queue). Then customer can create a custom bindings or just use these triggers. This is one idea. However, your code is very clean and good design for the responsibility. And this fix add the responsibility to it. If we have any other measure to get to know the function completed, I'm open.

https://github.com/Azure/azure-functions-durable-extension/blob/20b0dc22c7ac88c881f96343d044efc9373872fb/src/WebJobs.Extensions.DurableTask/Listener/TaskOrchestrationShim.cs#L96

cgillum commented 6 years ago

I like this suggestion about posting to Event Grid. I think it makes sense for our extension to do it like I think you are suggesting. I went ahead and created a GitHub issue to track it: https://github.com/Azure/azure-functions-durable-extension/issues/216

TsuyoshiUshio commented 6 years ago

Hi @cgillum, a few days ago, I post this blog for getting feedback from the users. Also, When I talk with my customers who wants to use the Durable Functions, a lot of them don't realize this fact. However, when I explain why it is necessary, they said that they really want this. Also, according to the detail blog, I've not got a lot of feedback.

https://medium.com/@tsuyoshiushio/safe-blue-green-deployment-with-durable-functions-905a1cda0450

I'm guessing something like this, except for my poor English skill lol

People don't realize the fact
People don't like the complexity.

As the official documentation, I recommend to hide the complexity of state management because of the feedback.

For example, we can have a Durable Functions task of VSTS, and the state management backend azure functions api which we developing on the hackfest. If we automate the deployment of the backend api, customer might not feel a lot of pain as an Azure-Samples.

What do you say?

cgillum commented 6 years ago

Is the suggestion to create a VSTS task which can be shared with customers, and then provide some code (maybe a serverless backend using Functions) which can also be shared and used to automate the deployment process?

TsuyoshiUshio commented 6 years ago

Yes. We working on it with customer. So we can share it. I guess is, what is the proper way to share on the official Durable Functions documentation. It might not be an official Task and backend sample. If so it is not fit for the "best practice as the official documentation". We can have several strategy for it. 1 is just explain the lifecycle event and how to do it, also share the idea how to safely deploy it. It is a good abstraction however, customer might need some effort to achieve it. 2. We can move the task and backend service with deployment script on the Azure-Samples, official document refer that and explain how to implement it with background architecture.(this is what I'm thinking now) 3. Make the task / backend as official one work with VSTS team however, 3. might takes time and we don't know if they have resource for that. But this is just my basic idea. I'm open. What do you think? Maybe we can have skype meeting in sometime might reduce our effort. :)

TsuyoshiUshio commented 6 years ago

Hi @cgillum , Since We learnt a lot about the DevOps story, I'd like to start writing Doc / Samples.

I'll implement pipeline both C# / V1, V2 and Javascript V2 with VSTS.

One thing I'm considering is blocking process. When a customer wants to deploy the durable app, customer make sure that there is no running process on the target task hub. Then the customer might want to stop accepting the new request.

It is controlled by customer side? or Implement the blocking state for the Orchestration client / REST API?

e.g. if they are the sate of close, then customer send request to start orchestrator, it through a special Exception.

cgillum commented 6 years ago

Great! One way to stop processing is to disable the function app. However, this will also prevent in-flight instances from completing, so it's only useful for "non-breaking" changes or emergency hotfixes.

For everything else, I assume what we want is to wait for existing instances to complete. This means we have to stop accepting new requests. One technique is to disable all the "trigger" functions. Unfortunately there is no easy way to do this generally. For Functions v2, it can be done using app settings. I don't believe there is a general solution using v1, so we will have to think of something else. A more robust solution might be for us to create a Durable Functions-specific setting that prevents creating new instances.

Can VSTS tasks do things like update a blob in Azure Storage?

TsuyoshiUshio commented 6 years ago

VSTS can upload a blob in Azure Storage. However, it is only upload/update the blob. Can't remove it. At least for the official/marketplace tasks. If it requires, I can develop the task. :)

cgillum commented 6 years ago

Actually, before we come up with the design, maybe we need to better understand the scenarios. There are a few questions we need to think about:

What if there are long-running orchestrations which need to run for days/weeks/months? It probably doesn't make sense to wait for those to drain.
What do we do about eternal orchestrations? By design, these will never complete.
What if there are abandoned orchestrations? For example, maybe it's waiting on an event which will never be sent.

How should these scenarios be handled when doing upgrades?

TsuyoshiUshio commented 6 years ago

IMO, In case of 2. 3. We can stop these. 2. is easy. However, 3. might be difficult to distinguish if it is abandoned or active one. Maybe we can identify it via timeout.

is also difficult. In case of 1, we should execute recovery plan. Ideally, if we could migrate from the old version to the new version through memory the request start to the OrchestrationClient. If it suspended, the orchestrator send the same request to the new version. However, this strategy only works if the whole orchestration with activity functions are idempotent. Maybe long running process days/weeks/months need to implement checkpoint by themselves.

What do you think?

SimonLuckenuik commented 6 years ago

Linking to this issue, since raising similar concerns : https://github.com/Azure/azure-functions-durable-extension/issues/320

SimonLuckenuik commented 6 years ago

For brainstorming purposes, a couple of questions/comments that could help on finding the proper approach: A) What is the proposed approach to deployment/upgrade scenario while having a currently executing functions in Azure Functions (without Durable)? B) What happen when a scale down occurs and a Function is currently executing on the node being stopped because of scaledown? C) What happen when stopping an Azure Function while a function is executing? D) If stopping of the Function App is required, it means that that Durable Functions should be kept in a separate Function App from any HTTP API Function App to prevent any "sync trigger" issue (service not reachable because we are doing an upgrade of Durable Function).

[Personal note] While I love this technology, from an external perspective of someone who invested time and effort to integrate it in a product, I am worried/surprised that those Durable Functions lifecycle concerns were not addressed from day one (or at least inside the initially preview).

TsuyoshiUshio commented 6 years ago

Thank you for your help! IMO A) If the function is idempotent, just replay it. FYI https://medium.com/@tsuyoshiushio/serverless-idempotency-with-azure-functions-23ed5da9d428 B) Maybe, according to the algorithm of scaling it might not happen. it is scale out intensive. What do you think, @cgillum ? C ) it is the same as A D ) If you just stop the app, when it is durable functions, it will leave the queue and storage table will keep the state. It is automatically replayed.

What do you think, @cgillum ?

cgillum commented 6 years ago

@SimonLuckenuik To put my perspective on this:

A) I believe Azure Functions (non-durable) has the same issue of in-progress work being interrupted during an upgrade. @jeffhollan is this your understanding as well?
B) It depends on the trigger. For HTTP triggers, App Service will stop sending traffic to the host instance which is being shut down. For non-HTTP triggers (queues, event hubs, etc.), a .NET function can subscribe to a CancellationToken to be notified of host shutdown. However, the amount of time allowed to execute when a shutdown is triggered is small and not all code can leverage this effectively, so you many need to account for failures or re-execution in this case.
C) This is the same as (B) as Tsuyoshi mentioned
D) The alternative to stopping a function for upgrade is side-by-side deployments and some form of "traffic manager" to route accordingly. This is the recommendation across web apps, functions, and durable.

In all cases where function execution gets interrupted, the user needs to rely on retry mechanics to handle it gracefully. For HTTP, that means retrying HTTP 503 errors that occur when a function app gets restarted. For non-HTTP (queues, event hub, durable, etc.), retry happens automatically. In all cases, idempotency in the function code is required. Unless you are leveraging a messaging platform with at-most-once or exactly-once (if such a thing exists) messaging guarantees, then you must accept that duplicate execution is possible.

From a versioning perspective, regular functions also have to be concerned about the impacts of code changes on existing data. For example, what if your function code changes the expected format of a queue message? How will existing systems (which might be enqueuing the message or dequeuing it) handle these kinds of breaking changes? From that perspective, Durable Functions is no different - orchestration instances are just a different form of persistent data. The good thing about durable is that it encourages developers to think about this more carefully because there is an explicit contract between different functions.

Responding to the personal note: I generally disagree with the sentiment that Durable Functions lifecycle concerns were not addressed from day one. The versioning guidance has certainly been there since day one. I think what's missing is 1) samples that people can reference as they think about their specific requirements and 2) tools/APIs to simplify implementing the guidance (things like Event Grid integration and APIs to enumerate all instances are examples of these). This work-item is primarily about 1) and will help identify ares where we should invest more in 2).

jeffhollan commented 6 years ago

Just jumping in to confirm. @SimonLuckenuik a lot of these impact functions far larger than just durable and are / should be considerations for any function regardless of type. In order to have "zero down time" deployments with Azure Functions you'd need to do some green/blue deployment with something like traffic manager or proxies if HTTP, or potential phased / competing consumer if non-HTTP. In short - nothing here unique to Durable and requires some manual coordination if want true zero downtime (downtime between deployments for each instance should be minimal as Chris explained).

In many ways I think Durable actually helps with a lot of these concerns as if an activity was "in flight" during a restart of an instance, the durable framework can help ensure an additional attempt is created to retry the message. Though is worth noting as you did the function app is the unit of deployment - and all pieces are deployed as a whole - so be conscious of the other functions included in the app with your durable function.

In terms of versioning (this specific issue) I think the main questions are as @cgillum called out that focus more on "if I have a durable orchestration instance that needs to last longer this version of the durable function, what's the best way to manage?" I think the questions on zero downtime deployment would likely best continue as a separate issue on the azure/azure-functions repo (though we are aware and have a few things on the backlog to help address this in the future).

Separately - the problem I have with waiting for an app to drain approach is it makes assumptions about the type of workload (short running, enough gaps between jobs that you can find a "slow spot" to hurry and deploy during). I think potentially the best chance for least disruption and support for long-running orchestrations is deploying a new app and routing all net-new requests to the new app and allowing existing instances to complete on the old app version (which may exist indefinitely). Means some of the stuff I mentioned at top potentially apply as well (if HTTP need a way to route to the new version without having to make client update URLs, if triggering from queues need to phase from app v1 to app v2, etc)

TsuyoshiUshio commented 6 years ago

Hi @cgillum As for sample, I'm developing DevOps pipeline for V1, V2, node durable functions. However, I have some issue to discuss.

Deployment method

I recommend to use Run-From-Zip deployment pipeline. However there is no official task for that. I had to use 3rd party task. Have you discuss with VSTS team?

Checking the state

Currently if we want to check the status of orchestration, we need to implement API or logic for the pipeline. I have two ideas to do this. 1. Implement via Azure Functions 2. Create a new VSTS task. Which would you prefer? In case of VSTS task, I'd like to implement it quickly by my self and hopefully make it official one or VSTS team implement the similar(or improved) solution for that.

In case of using EventGrid publishing, we need Azure Functions (backend) sample. or query all instances API, which would you prefer?

SimonLuckenuik commented 6 years ago

@jeffhollan, thank you for the clarifications, concerning the zero downtime deployment, I started a new thread here : https://github.com/Azure/Azure-Functions/issues/862

ConnorMcMahon commented 3 years ago

We have documentation for this scenario now.

Azure / azure-functions-durable-extension

DevOps guide for versioning #184