Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
714 stars 270 forks source link

Querying the Storage Table by instance id from within the DurableOrchestrationClient #2152

Closed jonathanharr closed 2 years ago

jonathanharr commented 2 years ago

Is your feature request related to a problem? Please describe. My problem appeared when I had developed a durable function. At the beginning, it calls an activity function that retrieves an OAuth2 token using a refresh_token from the database. Once it is finished with the token, it stores the refresh_token from the retrieved OAuth2 token in a CosmosDB container by replacing the old one with the new one.

  1. Retrieve OAuth2 token from previous refresh_token (ActivityFunction that returns the token to the DurableOrchestrationFunction)
  2. Do stuff with it (Various ActivityFunctions that need the token)
  3. Once finished, replace the previous refresh_token with the newly generated one (ActivityFunction that in its signature takes the refresh_token and sends it to the CosmosDB)

This worked fine during local development. Once deployed in Azure however, it could give som unexpected results. Sometimes it worked, sometimes it didn't work.

Most notable was when the instanced orchestration took too long to do its' work during the second step. What when would happen was that that the orchestration replayed itself and renewed the token before step 2 was finished. This would of course lead to a fail since it would use the old refresh_token that is no longer valid, and also fail on step 2 when the token suddenly was null.

A lot of things can go weird in this scenario, since the function itself doesn't hold state, it only reads the state from the table storage. I will describe the solution I'd like further down. But one of the solutions I figured out, was to query the Azure Storage Table with the orchestrationId (which is the partition_key holding all the rows related to the orchestration) and then figure out if my orchestration had already generated a new Oauth2 token or not.

Basically an if else scenario (pseudocode):

if (OAuth2TokenIsGeneratedByOrchestrationClient(orchestrationId)
  // query table for the token and return it from the activity function
else 
  // query Cosmos DB for the previous refresh_token, generate a new OAuth2 token and return it from the activity function

Are there any existing GitHub discussions or issues filed that help give some context to this proposal? I've seen some suggestions of querying the durable instance by its' orchestration name for a status update, which is quite similair to what I'm proposing, except I'm more interested of a larger perspective namely the entire history.

Allow querying durable instances by orchestration name #1296 Describe the solution you'd like Since everything is already stored inside Azure Storage, why not add functionality to directly retrieve what has been done or not?

Since the durable function already is dependant upon Azure Storage, it would make sense to give it the functionality to actually query what has happened so far.

Now this can be done in several ways and in my specific use case I basically want to recall an already finished activity based off of the durable orchestration instanceId. There should of course be some more general way to do it that suits more needs (like for instance [#1296](Allow querying durable instances by orchestration name #1296). But in my case it could have been something easy like:

        Token token = context.RecallActivityAsync<Token>("TokenRetriever", context.InstanceId);
        if (token == null)
        {
            token = await context.CallActivityAsync<Token>("TokenRetriever", null);
        }

Now don't take my idea here as the proposed idea of how to do it, but it would kind of make sense. This way it wouldn't rerun the ActitivtyFunction, it would rather just fetch the previous result from storage. (However, it might just make sense to have some sort of behavior that it does this by default? I don't know!)

Describe alternatives you've considered I tried some other methods. Like just storing the complete OAuth2 token entirely and retrieving it for use. I however wanted to avoid storing the access_token, as it came to my understanding that this was a very good way to limit risks of someone else using the token. (And of course someone else can just use the refresh_token as well, however, they would need a lot more data to renew the token than they would have needed to just use the access_token in my scenario)

Additional context This use case doesn't necessarily appear to be a good reason to do this maybe, but I figure there can be many scenarios where a replaying orchestration might just redo previous steps since the orchestration itself doesn't hold state, it relies on the storage for that.

In my case this was a project I was doing to learn more about Azure and DurableFunctions. I was using this with my Philips Hue smartlamps. The source code can be seen here. Please note that this project is just a bunch of experiments with a lot of stuff, most of it being functions. But the link to that subproject is a finished "POC" of what I am doing right now.

Also, I did try checking with the IsReplaying method, and I couldn't get it work with that at all.

davidmrdavid commented 2 years ago

Hi @DieselWeazel,

Thanks for reaching out. I'm not exactly sure I'm following your description, so I'll need to ask a few clarifying questions.

In particular, Orchestrator Functions are supposed to be deterministic, which in turn makes them replay-safe. Generally, we do this by "pushing" all non-determinism inside Activity Functions, whose results are stored in the orchestrator's internal History and then retrieved during replay.

With that in mind, I'm confused as to how the token is getting re-generated due to replay. Are you generating the token inside the orchestrator itself? From your code description, I thought you weren't, but the effects you're observing suggest that you are. Please let me know, thanks!

jonathanharr commented 2 years ago

Hey @davidmrdavid

Well then I am at a bit of a loss as to how I encountered the problem.. I do however understand what you're saying, I'm just confused as to how the problem originated.

The Activity Function that renewed the previous refresh_token returned it to the orchestrator. In a simple assignment like this:

Token token = context.CallActivityAsync<Token>("TokenRetriever", context.InstanceId);

The code worked fine locally, but when I pushed it to Azure it was really slow. The orchestration wasn't finished and after a while the the first Activity Function was called again to renew the token.

But from reading your response I'm thinking right now that maybe this was an error on my case. Since it should work, I'm beginning to wonder if I have by accident (or just impatience with the slow process) triggered another request which then has used the previous refresh_token whilst there actually is a new one in the other Activity Function.

From what I understand then, if an Orchestrator's internal history contains an TaskCompleted EventType from an Activity Function, it shall not and will not replay this Activity Function during a replay? Even if say Azure moves it to another VM or if it just takes too long, it will query its history by itself to see that it doesn't have to do what is already done? I'm wondering if I should do some tests maybe.. I apologize as well because I'm very open to the fact that this could have been an error on my end and that maybe I just misunderstood how the Orchestrator Function is replayed. O

davidmrdavid commented 2 years ago

Hi @DieselWeazel,

Thanks for your response, and no worries, the execution guarantees of Durable Functions aren't immediately obvious so these are all good questions!

Taking a step back, if the token-retrieval occurs inside an Activity, then at least we know that no orchestration code constraints are being violated. In other words, I think we're safe to assume your orchestration is replay-safe.

Regarding your question about TaskCompleted events, you're correct. If we have a record that the activity completed, then we should not be scheduling it again. For clarity: if the orchestration History has a complete record that your Activity was scheduled and that it then completed, then we will not re-schedule that Activity.

However, there's a caveat in the details. Durable Functions has an at-least-once message delivery guarantee for Activities. This means that if the Activity takes too long to complete and we assume that its work was somehow lost, then we will re-schedule that Activity. This is why Activity Functions need to be idempotent, so that they can be retried until a result is received. Once a result is received and durably stored in our History, then we will stop scheduling it.

Could it be that this is what's happening with your orchestrator? Have you seen any evidence that your Activity is executing multiple times? Thanks!

jonathanharr commented 2 years ago

Hey @davidmrdavid

No I'd like to thank you for taking your time with this since I'm a bit of a newbie here and it could just be an error on my end :)

Anyhow, regarding what you said about them needing to be idempotent. Does this mean that the retrieval of the OAuth2 token should be stored somewhere? Since else it might have renewed the token, but not stored it since that was done in step 3.

This isn't actually quite a big of a change, it would just be very important for the Activity Function to make sure that if it has actually refreshed the OAuth2 token then it must also store the new refresh_token. Else if the Durable Orchestrator reschedules the Activity Function it might end up trying to reauthorize the token and thus resulting in an error since it would be using a faulty refresh_token. (Or, old refresh_token rather)

However.. Since all it did was return it, if it has to store it and doesn't do that in time it could just as well fail..

I'm a bit curious here, what creates the assumption that the work is somehow lost? Is there a way to write code that takes this into consideration or? It seems a bit unstable if that is the case. I'm guessing there's a reason behind this and I'm curious to know more about it.

If I'm understanding you correctly, this means an activity function should always be able to produce the same value? What if that isn't desirable? Could be perhaps that you move data to some other db and you need to keep the datetime stamp correct for legal reasons or well yea, anything!

EDIT: The date thing is quite easy to come around of course.. But in my case the more I think of this the more it strikes me that relying on the activity function to refresh the OAuth2 token is a bit of a risk no matter what then? Or how would one come around this dillemma? Since it might end up not storing the new refresh_token in time and all?

davidmrdavid commented 2 years ago

@DieselWeazel:

Hmm, there's a lot of moving pieces here :-) .

I'm unsure if your Activity has to store the token immediately after obtaining it, but that's simply because I'm not sure of how your token refresh scenario works. What I can say is the following: if we schedule your Activity multiple times (due to the at-least-once delivery guarantee), we'll do it with the same input every time per orchestration replay. So if the Activity for TokenRefresh doesn't tolerate this due to the refresh token maybe being out of date, then that would be a problem.

Regarding your question on "what creates the assumption that the work was lost?", well that's a long long answer :-) . In general, since your Activities may be executing in different VMs, it is possible for those VMs to partially lose connectivity or slow down indefinitely. In the general case, it's impossible for the scheduling VM to know if the work is simply taking too long to complete, or if the VM executing the Activity simply lost connectivity and will never return a result. This is why, after some time, we will re-schedule the Activity to ensure your workflow eventually completes.

If I'm understanding you correctly, this means an activity function should always be able to produce the same value?

That's mostly correct. Note that this is not the same as having the Activity Function be deterministic, it just means that once it's produced a value, future invocations of it should produce the same value. That's idempotency :-)

In general, I agree designing for idempotency is difficult, but it can be done and the benefits are many! Do let me know if I can provide more guidance, thanks!

jonathanharr commented 2 years ago

I see! @davidmrdavid

So I'm guessing here that if this was really crucial the only way to get around it would be to say upgrade from a consumption plan to the premium plan or something even higher if that exists? (Since I'm guessing then you'd have some sort of dedicated VM that ensures that your Activity Function isn't lost in the pool of VMs)

davidmrdavid commented 2 years ago

@DieselWeazel:

Hmm, I wish it were the simple but no, upgrading plans will not help - the inability to fully observe the "health" of a remote procedure is a fundamental problem of Distributed Systems. I think the only way to truly avoid this would be to run everything in the same machine, at which point you're no longer taking advantage of distribution anymore :-) .

But let's take a step back. In general, while we do re-schedule Activities from time to time, this is not supposed to be a common occurrence. I have seen, anecdotally, many cases of customers who used Durable Functions for years without realizing Activities had a at-least-once delivery guarantee. In general, we minimize as much as possible the number of instances when we have to resort to duplicate Activity executions.

So with that in mind, it would be worthwhile to investigate if duplicate Activities is actually the root cause of your issue. Like I mentioned, it would be good to know if you have any logs showing that this is the case :)

jonathanharr commented 2 years ago

We can close this one. It was just an idea I had. And your help @davidmrdavid has taught me some more lessons along the way.

There are no logs by the way, but I never suspected this was the root cause of my issue. I was just curious as to how this works. And thanks again for replying and giving some insights into how this all works on a bit of a deeper level. :)