Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
715 stars 267 forks source link

Allow custom polling behavior in CallHttpAsync() for long-running endpoints #1183

Open idg10 opened 4 years ago

idg10 commented 4 years ago

The 202-based polling pattern that Durable Functions supports (both for consuming external HTTP endpoints, and also for presenting the status of Durable Functions in progress) is at odds with the HTTP specification, and also does not conform Microsoft's guidelines at https://github.com/Microsoft/api-guidelines/blob/vNext/Guidelines.md

We would like Durable Functions to be able to consume long-running operations that work in the way described in those guidelines. Currently, Durable Functions require an operation status endpoint to return a 202 status for as long as the operation is running. If the operation status endpoint returns a 200 status, Durable Functions interprets this as meaning that the operation is complete.

That is directly incompatible with section 13.2.5 of Microsoft's API guidelines at https://github.com/Microsoft/api-guidelines/blob/vNext/Guidelines.md#1325-operation-resource which say:

The GET operation against an operation MUST return:

  1. The operation resource, it's state, and any extended state relevant to the particular API.
  2. 200 OK as the response code.

The reason we'd prefer to have long-running endpoints work the way that these guidelines recommend is that they are consistent with the HTTP specification in a way that the approach taken by Durable Functions is not.

Durable Functions expect a GET against the URL representing an operation to return a 202 if the operation is incomplete, but the problem with this, from an HTTP Specification point of view, is that the meaning of a 202 status at this point is not "the operation hasn't finished", it is "we haven't finished retrieving the operation's status". That's not how Durable Functions interprets it of course, but it's what a 202 status at this point implies based on what the HTTP Spec says.

Where a 202 does come into this is as the return code from the endpoints that starts the operation. In that context it makes sense, because the HTTP operation it describes is typically a POST asking to do a thing, and the 202 correctly indicates that the work to do the thing is underway but not complete, and the response can include a header saying where to find out more. (Opinion is divided on exactly what that header should be. Azure uses either of 2 different headers for this as described at https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-manager-async-operations#monitor-status-of-operation while the Microsoft guidelines show a non-standard Operation-Location header in https://github.com/Microsoft/api-guidelines/blob/vNext/Guidelines.md#1327-the-typical-flow-polling and there's a proposal at https://github.com/Microsoft/api-guidelines/pull/101 to use the standard Content-Location instead; Durable Functions expects Location).

HTTP status codes tell you about the HTTP request at hand. They are not designed to tell you about the status of something else. The reason the API Guidelines say that the Operation endpoint MUST return a 200 (even if the operation is not in progress) is that this is how it tells the client that the GET request to fetch the status has been successfully handled.

With 202s, this is arguably not a problem (assuming you're prepared to ignore the abuse of the HTTP protocol) but where the problems in this design really become apparent is on failure. If you look at https://docs.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-http-api#get-instance-status it says that the way the orchestration instance endpoint reports the failure of the instance is by producing a 500 status.

The big problem with that is that there is no way to distinguish beween the failure of the orchestration instance, and the failure of the server to report the status of the instance.

That's a critical distinction. If a client tries to learn the status of an orchestration instance, and receives a 500 status code, it doesn't know anything about the status of the instance because there is no way to distinguish between failure of the orchestration instance or failure of the attempt to discover that instance's status. But if the request to fetch the status were to return a 200 and provide a body reporting that the instance has failed, the client would know exactly what the status of the instance is. The appropriate recovery/reporting action for this failure mode is completely different than for the first failure mode. (If the attempt to discover the status failed due to a server error, it's worth waiting and retrying because 500 errors are often transient. But in the case where the instance has failed, there's no use in retrying the get status because it's going to carry on telling you the same thing.) But with the current design, you can't tell which of these has happened when you get a 500.

This is exactly why you mustn't use the HTTP status to report anything other than how the particular request being responded to was handled. As soon as you start trying to merge other information into the HTTP status response (e.g., information about how some other operation is going) you introduce ambiguity.

The canonical HTTP mechanism for doing this is to use the status code purely to describe the request handling, and then to put all other state of interest into the response body (or possibly response headers). And that's exactly what the Microsoft API guidelines say to do.

I understand that Durable Functions are now committed to this design when it comes to the implementation of the server side of this pattern because it's out there. So much as I'd like to see it fixed, I understand you've got customers depending on it. (Although you could still introduce a new orchestration instance endpoint that is compliant with the HTTP spec, and gradually deprecate the old one.)

But really want to implement my own APIs in a way that complies with the HTTP spec (which means I must be able to produce a 200 status code from the endpoint reporting the status of an operation in progress regardless of whether that operation is complete, in progress, or has failed). And ideally I'd be able to take advantage of Orchestration's ability to consume long-running operations to monitor operations that report their status in a way that is compliant with the HTTP specification and with Microsoft's own recommendations.

So even if you can't change how Functions' implementation of the server side of this pattern works, you could modify your implementation of the client side of the pattern so that it's possible to consume HTTP-compliant status endpoints.

One way to do this would be to open up the logic that looks at the response from a status endpoint and decides whether it's done. Currently that logic is baked into DurableOrchestrationContext's implementation of IDurableOrchestrationContext.CallHttpAsync at https://github.com/Azure/azure-functions-durable-extension/blob/v2/src/WebJobs.Extensions.DurableTask/ContextImplementations/DurableOrchestrationContext.cs#L233

If this logic could be factored out, with an option to plug in some other test (e.g., something conceptually equivalent to Func<HttpResponseMessage, Task<bool>>, although the reality might be more complex) I could then plug in a test which, rather than inferring the operation status from the HTTP status code, looked at the body and retrieved the status property from the JSON in the response (i.e., using the mechanism described in Microsoft's API guidelines).

ConnorMcMahon commented 4 years ago

@idg10

Thank you for the feedback. I think you bring up a lot of very valid points. To clarify, while our design does not match the Microsoft API guidelines you linked to, we did follow the asynchronous request-reply pattern mentioned in other Microsoft documentation.

At the end of the day, because there is no true universal standard for how to handle the asynchronous pattern, we went with the approach on our client-side implementation that was compatible with our server-side implementation, to allow coordination across different Durable Functions applications. It is very possible that there were better alternatives, but at this point we are a bit stuck on what the default behavior is.

I definitely think allowing some callback function to decide retry behavior is an excellent way to allow the best of both worlds. We should definitely revisit the defaults in the future when we decide to do another major release, but it will likely be a long time before we do that again.

As a side-note, the documentation regarding our GET orchestration status endpoint is actually incorrect for Durable v2.x (I'll open a separate issue to get that fixed). By default, we now return a 200 status code on orchestration failures. In order to restore the v1.x behavior, the url needs to use the query string parameter returnInternalServerErrorOnFailure=true. You can see the code for that here.

markheath commented 4 years ago

I think there is scope for several improvements to the way CallHttpAsync works for long-running operations.

First, there is no support for retries. You'd probably want separate retry policy options for the "initial" call to the endpoint, and for the calls to the endpoint for status polling. This could be supported with two optional RetryOptions properties on the DurableHttpRequest

Second, it would be nice to be able to specify an overall maximum polling duration, after which it could throw a time out exception.

Third, from my experiments it seems that the original HTTP request will time out in just under 90 seconds. It would be nice to be able to control that timeout duration with another property on DurableHttpRequest

Fourth, it seems a bit odd that the polling endpoint has to keep specifying the Location header. Could there be an option to always poll the Location header returned from the original request. Or even just to default to using that if there was no Location header returned by the polling endpoint.

Fifth, as @idg10 has pointed out, the polling pattern is a little idiosyncratic and doesn't necessarily match how many long-running APIs work in the wild. Maybe DurableHttpRequest could take a func that inspects the polling response and returns whether to keep polling or not.

Finally, the ManagedIdentityTokenSource that you can provide doesn't let you override the azureAdInstance constructor for the underlying AzureServiceTokenProvider, which prevents you from using it against Azure Government.

I'd be interested to hear any feedback on these suggestions, as we'd like to use this feature for some of our workflows and might be able to offer a PR with these changes in.

bachuv commented 4 years ago

Below are some possible modifications we can make to incorporate custom polling into the CallHttpAsync API. These modifications would allow customers to provide a Func<T, TResult> delegate to the CallHttpAsync API.

There will be a new type called PollingDecision that has 2 subtypes:

  1. PollAgain subtype: DurableHttpRequest
  2. DonePolling subtype: Empty Object

Func<T, TResult> delegate signature for the CallHttpAsync API:

Using this delegate function signature would be simple to use and implement. It would also allow synchronous operations because it would be performed in an orchestrator function.

Modifications to the 2 existing CallHttpAsync APIs:

  1. Task<DurableHttpResponse> CallHttpAsync(DurableHttpRequest req, Func<DurableHttpResponse, PollingDecision> pollingDecider = null)
  2. Task<DurableHttpResponse> CallHttpAsync(HttpMethod method, Uri uri, string content = null, Func<DurableHttpResponse, PollingDecision> funcName = null)

The Func<DurableHttpResponse, PollingDecision> delegate will allow customers to implement their own polling logic.

Also, 2 new issues were created to address the polling durations (issue #1278) and configuring ManagedIdentityTokenSource to work with government clouds (issue #1279).

MedAnd commented 4 years ago

Doing a deep dive this week with Durable Functions to validate their readiness for large scale e-commerce workloads and came across what the guys have mentioned above; whilst dissecting the Human interaction in Durable Functions - Phone verification sample.

Whilst we can create our own Polly based retry and timeout wrappers around DurableClient GetStatusAsync, I'm wondering why not support various input bindings also for DurableOrchestrationContext WaitForExternalEvent, for true pub / sub? For example WaitForExternalEvent could trigger on Azure Service Bus, CosmosDB etc as opposed to having a function trigger first, which then has to call DurableClient RaiseEventAsync.

The other big wish-list item is better handling of scenarios where the caller is latency sensitive, requires more synchronous results. In the Human interaction in Durable Functions - Phone verification sample again, when a user enter a response to a 2FA challenge, the UI (calling application) either has to continually poll a HttpTrigger function which in turn is using DurableClient GetStatusAsync or the HttpTrigger function has to internally poll via DurableClient GetStatusAsync, assuming the orchestration completes in under 90 seconds & the calling application does not bail on the HttpTrigger function. Both options feel like resource efficiency anti-patterns.

In addition, it's not very clear from existing samples and documentation what the best practices are around long running orchestrations and progress state. External calling functions may require to report on & correlate progress. Should this progress state be persisted to a DurableEntity, maybe in CosmosDB etc? Moreover maybe for these scenarios where the calling application wants more immediate feedback as opposed to polling the orchestration, would it be more efficient to use the Azure Functions bindings for SignalR Service?

@cgillum @jeffhollan @anthonychu & team - thoughts on above much appreciated!

cgillum commented 4 years ago

In addition, it's not very clear from existing samples and documentation what the best practices are around long running orchestrations and progress state. External calling functions may require to report on & correlate progress.

@MedAnd I suggest you take a look at our Custom orchestration status documentation. It's very much designed for projecting state to external callers. It would be much simpler than trying to post status to a separate entity and making the caller discover and poll it. The existing orchestration status API will return this custom status data back to the caller, who can use it to take some action if necessary.

The other big wish-list item is better handling of scenarios where the caller is latency sensitive, requires more synchronous results. In the Human interaction in Durable Functions - Phone verification sample again, when a user enter a response to a 2FA challenge, the UI (calling application) either has to continually poll a HttpTrigger function which in turn is using DurableClient GetStatusAsync or the HttpTrigger function has to internally poll via DurableClient GetStatusAsync, assuming the orchestration completes in under 90 seconds & the calling application does not bail on the HttpTrigger function. Both options feel like resource efficiency anti-patterns.

If you're concerned about the scalability and latency implications of polling, then I would suggest looking into a push-based notification system using SignalR or some similar technology. The orchestration could push notifications to clients in a way that is more real-time and that doesn't put pressure on your storage account. I'm considering whether we can do work to make integration between Durable and SignalR more native, but until then it's possible to wire these things together yourself.

I'm wondering why not support various input bindings also for DurableOrchestrationContext WaitForExternalEvent, for true pub / sub? For example WaitForExternalEvent could trigger on Azure Service Bus, CosmosDB etc as opposed to having a function trigger first, which then has to call DurableClient RaiseEventAsync.

Ultimately all messages that get delivered to an orchestration need to be collected into a common location. In this case, that common location is the control queues in Azure Storage. We could listen to Service Bus queues and topics, but we'd still need to transfer those messages into Azure Storage in order for orchestration message processing to work correctly in the backend. We could make it built-in like we do for our built-in HTTP APIs, but it requires us to be opinionated about how we correlate messages from external stores to specific orchestration instances. We elected to just allow users to do this correlation themselves using external triggers and then use the Durable Client to forward the correct message to the correct instance. It's a little more code, but it's quite flexible.

ConnorMcMahon commented 4 years ago

@MedAnd

Thanks for the feedback. I see Chris already began to address many of your concerns. If you have further thoughts or questions around this, I would recommend that you open a new issue and link to that here, just so that we can keep this issue focused on the custom polling behavior of our API, and we can have a more focused discussion around your proposals there.

iMicknl commented 4 years ago

I just faced the same issue when trying to implement the 202 handling via call_http in Azure Durable Functions for Python. It took me a while before I figured out that Cognitive Services are using the Operation-Location header and that call_http is expecting a Location header.

It would be great if there would be a way to specific the header for 202 retry handling, when you use call_http. Is my current scenario blocked or is there already a way to overcome this limitation?

https://docs.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-http-features?tabs=python#http-202-handling

idg10 commented 2 years ago

The comments for the PR #1918 suggest that that PR fixes this.

I don't think it does, for the reasons I describe at https://github.com/Azure/azure-functions-durable-extension/pull/1918#issuecomment-950605165