Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
711 stars 263 forks source link

Out of proc still limited to < 7d for timers #1910

Open drub0y opened 2 years ago

drub0y commented 2 years ago

We're working on a TypeScript (hence node) based durable orchestration where we expected to be able to call createTimer with dates greater than 7d out. We've been doing this ever since #1390 was completed in C# so we just expected it to work, but we run into the following stack trace:

Function 'myFunction (Orchestrator)' failed with an error. Reason: System.ArgumentException: The Azure Storage provider supports a maximum of 6 days for time-based delays (Parameter 'fireAt')
[2021-08-06T22:15:51.743Z]    at Microsoft.Azure.WebJobs.Extensions.DurableTask.DurableOrchestrationContext.ThrowIfInvalidTimerLengthForStorageProvider(DateTime fireAt) in D:\a\r1\a\azure-functions-durable-extension\src\WebJobs.Extensions.DurableTask\ContextImplementations\DurableOrchestrationContext.cs:line 397
[2021-08-06T22:15:52.069Z]    at Microsoft.Azure.WebJobs.Extensions.DurableTask.OutOfProcOrchestrationShim.InvokeAPIFromAction(AsyncAction action) in D:\a\r1\a\azure-functions-durable-extension\src\WebJobs.Extensions.DurableTask\Listener\OutOfProcOrchestrationShim.cs:line 125
[2021-08-06T22:15:52.166Z]    at Microsoft.Azure.WebJobs.Extensions.DurableTask.OutOfProcOrchestrationShim.ProcessAsyncActionsV1(AsyncAction[][] actions) in D:\a\r1\a\azure-functions-durable-extension\src\WebJobs.Extensions.DurableTask\Listener\OutOfProcOrchestrationShim.cs:line 249
[2021-08-06T22:15:52.175Z]    at Microsoft.Azure.WebJobs.Extensions.DurableTask.OutOfProcOrchestrationShim.ReplayOOProcOrchestration(AsyncAction[][] actions, SchemaVersion schema) in D:\a\r1\a\azure-functions-durable-extension\src\WebJobs.Extensions.DurableTask\Listener\OutOfProcOrchestrationShim.cs:line 225
[2021-08-06T22:15:52.200Z]    at Microsoft.Azure.WebJobs.Extensions.DurableTask.OutOfProcOrchestrationShim.ScheduleDurableTaskEvents(OrchestrationInvocationResult result) in D:\a\r1\a\azure-functions-durable-extension\src\WebJobs.Extensions.DurableTask\Listener\OutOfProcOrchestrationShim.cs:line 84
[2021-08-06T22:15:52.211Z]    at Microsoft.Azure.WebJobs.Extensions.DurableTask.OutOfProcOrchestrationShim.HandleDurableTaskReplay(OrchestrationInvocationResult executionJson) in D:\a\r1\a\azure-functions-durable-extension\src\WebJobs.Extensions.DurableTask\Listener\OutOfProcOrchestrationShim.cs:line 64
[2021-08-06T22:15:52.221Z]    at Microsoft.Azure.WebJobs.Extensions.DurableTask.TaskOrchestrationShim.TraceAndReplay(Object result, Exception ex) in D:\a\r1\a\azure-functions-durable-extension\src\WebJobs.Extensions.DurableTask\Listener\TaskOrchestrationShim.cs:line 233
[2021-08-06T22:15:52.229Z]    at Microsoft.Azure.WebJobs.Extensions.DurableTask.TaskOrchestrationShim.InvokeUserCodeAndHandleResults(RegisteredFunctionInfo orchestratorInfo, OrchestrationContext innerContext) in D:\a\r1\a\azure-functions-durable-extension\src\WebJobs.Extensions.DurableTask\Listener\TaskOrchestrationShim.cs:line 155. IsReplay: False. State: Failed. HubName: MyTaskHub. AppName: . SlotName: . ExtensionVersion: 2.5.0. SequenceNumber: 13. TaskEventId: -1

This was, as you can imagine, shocking, but then I tracked it down to here:

https://github.com/Azure/azure-functions-durable-extension/blob/d8186d84c51a982929debf7b5b6186ca236d4822/src/WebJobs.Extensions.DurableTask/Listener/OutOfProcOrchestrationShim.cs#L120-L123

Which calls here:

https://github.com/Azure/azure-functions-durable-extension/blob/821b862f534684d8ffab5e58cddeb94b563e6c9b/src/WebJobs.Extensions.DurableTask/ContextImplementations/DurableOrchestrationContext.cs#L387-L397

The comment is pretty clear about this being here for possible "out of proc" issues, but... is there really an issue or was it just forgotten about? I'm actually amazed it's nearly a year later since #1390 and nobody else has run into this or at least not reported it.

cgillum commented 2 years ago

Pinging @davidmrdavid and @bachuv for this one. It's been a while, so I don't quite remember what the limitations for OOProc were at the time, or whether they still apply.

davidmrdavid commented 2 years ago

Hi @drub0y,

Thanks for reaching out! Indeed, we've known about this limitation for some time now. If I recall correctly, long-running timers are managed by the durable-extension and durable-task framework by splitting them into a sequence of smaller timers. I believe this trick made it particularly hard to deal with long-running timers in our current out-of-process orchestration replay algorithm, as it required careful reasoning of timer-fired events in our history array. As a result, I remember we decided to prioritize revamping our replay algorithm first, which should facilitate incorporating this feature, as well as provide some long-needed correctness and performance improvements to the SDK.

For the past few months, we've been working on doing just that, and we're nearing the release of a revamped replay algorithm for our out-of-proc SDKs. After those are out, we should be unblocked to resume work on this feature. In the meantime, I believe that manually sequencing timers should allow for a long-running timer-like experience.

drub0y commented 2 years ago

Well, I'm glad to hear it's being tracked, but it's an unfortunate discovery for us in the meantime as we'll have to go back to writing our own timeout loops into our orchestrations now. 😞

ConnorMcMahon commented 2 years ago

As a note, we have updated our docs in the meantime to reflect that long-running timers only work on C# currently.

wesleysmitthe commented 1 year ago

Any updating about the long-running using node or python?

davidmrdavid commented 1 year ago

Hi @wesleysmitthe. I believe NodeJS already supports this (@hossam-nasr can you please confirm that the current released Extension supports this).

The work in Python has not been prioritized yet.

danielniccoli commented 1 month ago

I'm running into the same issue (Durable Functions / Python). A maximum of 6 days for WaitForExternalEvent is not enough for our use-cases. We need up to 30 days. In #1538 it was said that it's just a bug and longer time-outs are possible. This issue suggests that it is not a bug. I'm confused, but also unable to continue my work due to this limitation.

@davidmrdavid What is the status of this?

davidmrdavid commented 1 month ago

Hey @danielniccoli: this is definitely implemented at the Durable Extension level (this repo) but it's possible there's a gap in Python. Can you tell me how you're pairing timers and external events? As far as I know, the external API in Python does not support a timer parameter (https://github.com/Azure/azure-functions-durable-python/blob/3f0e168b9d5b0b3ea06de878586aa6436dd6a3a1/azure/durable_functions/models/DurableOrchestrationContext.py#L572) so it sounds to me like you're scheduling an explicit timer in addition to the external event? A small code snippet would help.

davidmrdavid commented 1 month ago

In the meantime, a manual workaround is always to split your timer into several smaller but sequential ones

danielniccoli commented 1 month ago

@davidmrdavid

As far as I know, the external API in Python does not support a timer parameter

It does, see Timers in Durable Functions (Azure Functions).

this is definitely implemented at the Durable Extension level (this repo)

I don't know enough to make an assessment, but doesn't the error indicate that it's a problem with the underlying code?

System.ArgumentException: The Azure Storage provider supports a maximum of 6 days for time-based delays (Parameter 'fireAt') does not look like a gap in Python, but rather in some C# code.

Below the code snippet. We're using external events for an approval flow. At a certain point, the orchestration waits for approval of two people, the owner and the deputy. If they do not approve withing 21 days, the request is considered "declined" and the orchestration will exit.

...

due_date = context.current_utc_datetime + timedelta(days=21)
context.set_custom_status(
    {
        "status": "Waiting for approval",
        "details": {"dueDate": due_date.isoformat()},
    }
)
due_date_task = context.create_timer(due_date)
owner_approval_event_task = context.wait_for_external_event("OwnerApprovalEvent")
deputy_approval_event_task = context.wait_for_external_event("DeputyApprovalEvent")
owner_approval_winning_task = yield context.task_any(
    [owner_approval_event_task, due_date_task]
)
deputy_approval_winning_task = yield context.task_any(
    [deputy_approval_event_task, due_date_task]
)

if (
    owner_approval_winning_task == due_date_task
    or deputy_approval_winning_task == due_date_task
):
    # TODO: Not all approvals have been given before deadline
    # TODO: Complete the rochestration with status: unapproved
    context.set_custom_status(
        {
            "status": "Declined",
            "details": "Approvals were not given",
        }
    )
    return

...

Splitting it into several sequential timers could be done, but it's not something I'd like to do. I consider workarounds to be resolvable withing a certain time period. Since this ticket exists for almost 3 years, it feels more like a definitive solution 😬

davidmrdavid commented 1 month ago

Hi @danielniccoli,

I took a look and I see the issue now - the Python SDK is not yet onboarded to the protocol that allows for automatic long timer creation. The basic idea for how to implement that is in this PR (https://github.com/Azure/azure-functions-durable-js/pull/340), which is moderately short, but it is not something we have the bandwidth to immediately prioritize relative to other work. In the meantime, I would recommend to continue using the workaround. I apologize this isn't implemented yet, I realize it is frustrating on your end. My next ~3 weeks do seem rather busy, but if you ping me again after those 3 weeks, I may have sufficient bandwidth to re-prioritize this. Alternatively, we do routinely accept open source contributions in case the JS-based PR above is sufficient guidance (the two codebase are purposely almost identical).