Azure / Azure-Functions

1.12k stars 199 forks source link

Durable Entities keeps re-running task after successful completion #2337

Open sourcerist opened 1 year ago

sourcerist commented 1 year ago

I have a long running task that needs to only have one instance running at a time. I chose Azure Durable Entities based on the documentation that these are designed for this sort of situation, but it seems that past a certain threshold the entity will re-run the task after execution has completed. It does the job of single threading the task beautifully, but never seems to recognize that it has completed.

Here is an example with the Sleep call representing the long running task execution:

[FunctionName("LongRunningTask")]
public static void LongRunningTask([EntityTrigger] IDurableEntityContext context, ILogger log)
{
    var sleepTime = context.GetInput<TimeSpan>();
    var state = context.GetState<LongRunningTaskState>() ?? new LongRunningTaskState();
    log.LogInformation($"Waiting for {sleepTime}...  State when started: {JsonConvert.SerializeObject(state)}");
    System.Threading.Thread.Sleep(sleepTime);
    state.RunCount++;
    context.SetState(state);
    var updatedState = context.GetState<LongRunningTaskState>() ?? new LongRunningTaskState();
    log.LogInformation($"Finished waiting {sleepTime}.  New state is {JsonConvert.SerializeObject(updatedState)}");
    context.Return(state);
}

[FunctionName("StartLongRunningTask")]
public static async Task<IActionResult> StartLongRunningTask(
    [HttpTrigger(AuthorizationLevel.Anonymous, "get", "post", Route = "StartLongRunningTask/{seconds:int}")] HttpRequestMessage req,
    int seconds,
    [DurableClient] IDurableEntityClient starter,
    ILogger log)
{
    var delay = TimeSpan.FromSeconds(seconds);
    var entityId = new EntityId(nameof(LongRunningTask), "Singleton");
    await starter.SignalEntityAsync(entityId, "Download", operationInput: delay);
    return (ActionResult)(new OkObjectResult($"Schedule task for {delay}"));
}

If I tell it to wait 30 seconds, it behaves as expected:

[2023-02-03T23:02:22.452Z] Executing 'StartLongRunningTask' (Reason='This function was programmatically called via the host APIs.', Id=542d8dad-58a2-47a1-94ac-5f5489a56d67)
[2023-02-03T23:02:22.514Z] @longrunningtask@Singleton: Function 'longrunningtask (Entity)' scheduled. Reason: EntitySignal:Download. IsReplay: False. State: Scheduled. HubName: TestHubName. AppName: . SlotName: . ExtensionVersion: 2.9.0. SequenceNumber: 2.
[2023-02-03T23:02:22.529Z] Executed 'StartLongRunningTask' (Succeeded, Id=542d8dad-58a2-47a1-94ac-5f5489a56d67, Duration=98ms)
[2023-02-03T23:02:22.660Z] Executing 'LongRunningTask' (Reason='(null)', Id=11e224a1-acc1-404e-b29f-7dd062e16e40)
[2023-02-03T23:02:22.666Z] @longrunningtask@Singleton: Function 'longrunningtask (Entity)' started. IsReplay: False. Input: (216 bytes). State: Started. HubName: TestHubName. AppName: . SlotName: . ExtensionVersion: 2.9.0. SequenceNumber: 3. TaskEventId: -1
[2023-02-03T23:02:22.676Z] Waiting for 00:00:30...  State when started: {"runCount":1}
[2023-02-03T23:02:52.773Z] Finished waiting 00:00:30.  New state is {"runCount":2}
[2023-02-03T23:02:52.777Z] @longrunningtask@Singleton: Function 'longrunningtask (Entity)' completed 'Download' operation 7470263c-de86-4243-84e5-eec14d489af0 in 30103.6145ms. IsReplay: False. Input: (216 bytes). Output: (56 bytes). HubName: TestHubName. AppName: . SlotName: . ExtensionVersion: 2.9.0. SequenceNumber: 4.
[2023-02-03T23:02:52.809Z] @longrunningtask@Singleton: Function 'longrunningtask (Entity)' completed. ContinuedAsNew: True. IsReplay: False. Output: (56 bytes). State: Completed. HubName: TestHubName. AppName: . SlotName: . ExtensionVersion: 2.9.0. SequenceNumber: 5. TaskEventId: -1
[2023-02-03T23:02:52.815Z] Executed 'LongRunningTask' (Succeeded, Id=11e224a1-acc1-404e-b29f-7dd062e16e40, Duration=30157ms)

But if I tell it to wait 10 minutes, it enters an endless loop where it doesn't recognize the successful completion of the previous iteration. Note the state at the beginning of each attempt stays at "runCount": 1:

[2023-02-03T22:29:41.355Z] Executing 'StartLongRunningTask' (Reason='This function was programmatically called via the host APIs.', Id=bd751b6f-71f8-4c1d-97ee-24b3027a6d85)
[2023-02-03T22:29:41.427Z] @longrunningtask@Singleton: Function 'longrunningtask (Entity)' scheduled. Reason: EntitySignal:Download. IsReplay: False. State: Scheduled. HubName: TestHubName. AppName: . SlotName: . ExtensionVersion: 2.9.0. SequenceNumber: 2.
[2023-02-03T22:29:41.453Z] Executed 'StartLongRunningTask' (Succeeded, Id=bd751b6f-71f8-4c1d-97ee-24b3027a6d85, Duration=124ms)
[2023-02-03T22:29:41.474Z] Executing 'LongRunningTask' (Reason='(null)', Id=bdbb65aa-ac65-4252-bab8-028974a81148)
[2023-02-03T22:29:41.481Z] @longrunningtask@Singleton: Function 'longrunningtask (Entity)' started. IsReplay: False. Input: (216 bytes). State: Started. HubName: TestHubName. AppName: . SlotName: . ExtensionVersion: 2.9.0. SequenceNumber: 3. TaskEventId: -1
[2023-02-03T22:29:41.500Z] Waiting for 00:10:00...  State when started: {"runCount":1}
[2023-02-03T22:39:41.586Z] Finished waiting 00:10:00.  New state is {"runCount":2}
[2023-02-03T22:39:41.592Z] @longrunningtask@Singleton: Function 'longrunningtask (Entity)' completed 'Download' operation 8a07b5fb-fa93-4ff5-9278-2b33eb125ea1 in 600097.4966ms. IsReplay: False. Input: (216 bytes). Output: (56 bytes). HubName: TestHubName. AppName: . SlotName: . ExtensionVersion: 2.9.0. SequenceNumber: 4.
[2023-02-03T22:39:41.616Z] @longrunningtask@Singleton: Function 'longrunningtask (Entity)' completed. ContinuedAsNew: True. IsReplay: False. Output: (56 bytes). State: Completed. HubName: TestHubName. AppName: . SlotName: . ExtensionVersion: 2.9.0. SequenceNumber: 5. TaskEventId: -1
[2023-02-03T22:39:41.619Z] Executed 'LongRunningTask' (Succeeded, Id=bdbb65aa-ac65-4252-bab8-028974a81148, Duration=600148ms)
[2023-02-03T22:39:41.653Z] Executing 'LongRunningTask' (Reason='(null)', Id=18e7056a-22e2-496f-8486-a7704791713e)
[2023-02-03T22:39:41.654Z] @longrunningtask@Singleton: Function 'longrunningtask (Entity)' started. IsReplay: False. Input: (216 bytes). State: Started. HubName: TestHubName. AppName: . SlotName: . ExtensionVersion: 2.9.0. SequenceNumber: 6. TaskEventId: -1
[2023-02-03T22:39:41.656Z] Waiting for 00:10:00...  State when started: {"runCount":1}
[2023-02-03T22:49:42.466Z] Finished waiting 00:10:00.  New state is {"runCount":2}
[2023-02-03T22:49:42.470Z] @longrunningtask@Singleton: Function 'longrunningtask (Entity)' completed 'Download' operation 4e08a7d2-a0bf-4d68-bc2a-eff989ef3007 in 600814.92ms. IsReplay: False. Input: (216 bytes). Output: (56 bytes). HubName: TestHubName. AppName: . SlotName: . ExtensionVersion: 2.9.0. SequenceNumber: 7.
[2023-02-03T22:49:42.473Z] @longrunningtask@Singleton: Function 'longrunningtask (Entity)' completed. ContinuedAsNew: True. IsReplay: False. Output: (56 bytes). State: Completed. HubName: TestHubName. AppName: . SlotName: . ExtensionVersion: 2.9.0. SequenceNumber: 8. TaskEventId: -1
[2023-02-03T22:49:42.475Z] Executed 'LongRunningTask' (Succeeded, Id=18e7056a-22e2-496f-8486-a7704791713e, Duration=600822ms)
[2023-02-03T22:49:42.502Z] Executing 'LongRunningTask' (Reason='(null)', Id=d57f3470-af6f-464a-9b14-8091e70946c2)
[2023-02-03T22:49:42.503Z] @longrunningtask@Singleton: Function 'longrunningtask (Entity)' started. IsReplay: False. Input: (216 bytes). State: Started. HubName: TestHubName. AppName: . SlotName: . ExtensionVersion: 2.9.0. SequenceNumber: 9. TaskEventId: -1
[2023-02-03T22:49:42.505Z] Waiting for 00:10:00...  State when started: {"runCount":1}
[2023-02-03T22:59:42.567Z] Finished waiting 00:10:00.  New state is {"runCount":2}
[2023-02-03T22:59:42.570Z] @longrunningtask@Singleton: Function 'longrunningtask (Entity)' completed 'Download' operation 8a07b5fb-fa93-4ff5-9278-2b33eb125ea1 in 600065.2645ms. IsReplay: False. Input: (216 bytes). Output: (56 bytes). HubName: TestHubName. AppName: . SlotName: . ExtensionVersion: 2.9.0. SequenceNumber: 10.
[2023-02-03T22:59:42.571Z] @longrunningtask@Singleton: Function 'longrunningtask (Entity)' completed. ContinuedAsNew: True. IsReplay: False. Output: (56 bytes). State: Completed. HubName: TestHubName. AppName: . SlotName: . ExtensionVersion: 2.9.0. SequenceNumber: 11. TaskEventId: -1
[2023-02-03T22:59:42.572Z] Executed 'LongRunningTask' (Succeeded, Id=d57f3470-af6f-464a-9b14-8091e70946c2, Duration=600070ms)
[2023-02-03T22:59:42.594Z] Executing 'LongRunningTask' (Reason='(null)', Id=94e639a1-c460-494e-b48c-3215253acf6f)
[2023-02-03T22:59:42.595Z] @longrunningtask@Singleton: Function 'longrunningtask (Entity)' started. IsReplay: False. Input: (216 bytes). State: Started. HubName: TestHubName. AppName: . SlotName: . ExtensionVersion: 2.9.0. SequenceNumber: 12. TaskEventId: -1
[2023-02-03T22:59:42.596Z] Waiting for 00:10:00...  State when started: {"runCount":1}

I have the function timeout set to 4 hours in my host.json file and a premium app service plan that allows for longer running functions.

bhagyshricompany commented 1 year ago

thanks for update pls share allrepro steps

bhagyshricompany commented 1 year ago

pls share function name timestamp region etc .

sourcerist commented 1 year ago

EntityBugRepro.zip @bhagyshricompany here are the repro steps: 1.) open the attached EntityBugRepro.zip project in VS2022 2.) Run the project either with or without debugging 3.) Get the route prefix for the StartLongRunningTask function (in my case it was http://localhost:7084/api/StartLongRunningTask/) 4.) Tell the task to run for 30 seconds by going to e.g. http://localhost:7084/api/StartLongRunningTask/30 and note that the 'LongRunningTask' function executes once: image 5.) Tell the task to run for 10 minutes by going to e.g. http://localhost:7084/api/StartLongRunningTask/600 and note that it executes 'LongRunningTask' repeatedly. Note also that 'StartLongRunningTask' only executes once despite 'LongRunningTask' executing many times, and that "(Reason='(null)" appears in the logs on executions after the first: image

bhagyshricompany commented 1 year ago

thanks for input will check at our end and update you asap

sourcerist commented 1 year ago

@bhagyshricompany I reproduced it using a fresh install of Visual Studio 2022 Community Edition: image Here are the complete logs: durable_entity_logs.txt

sebastianburckhardt commented 1 year ago

I have found the cause of the bug. It appears the AzureStorage provider (the default storage provider backend) does not renew the pop receipts for the message that is being processed, so by the time the work item commits, it fails when it tries to delete the message. This then causes the whole thing to run again.

We will need to fix this by adding code to renew the pop receipt. Something similar may already be in place for actitivies, I believe.

sourcerist commented 1 year ago

@sebastianburckhardt that's great! Do you have a timeline for when it will be released?

sourcerist commented 1 year ago

@sebastianburckhardt I see you merged this 3 weeks ago, but the latest release is still from Feb 1. Can you let me know when a release with this fix will be available?

sebastianburckhardt commented 1 year ago

There have been delays in getting both of the required packages released. I believe @davidmrdavid is still working on releasing an update.

davidmrdavid commented 1 year ago

Yes, apologies for the delay. I expect these packages to release this week though, as the validation has succeeded.