Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
710 stars 263 forks source link

Durable orchestration take a long time to start with their activities. #2813

Closed well0549 closed 1 month ago

well0549 commented 1 month ago

Description

I have 12 durable functions that process messages from a queue. The process is quite simple,

It's all consumption plan, .netcore8, isolated.

Now the problem is that the messages get consumed by the queue trigger, and put in the taskhub instances table with the status pending. And some of them stay there in that state for about 8 to 10 minutes. (with the status pending) nothing is happening.

Expected behavior

If a message is put on a queue to be processed by a durable orchestration, the function could be asleep so it could take some time to process the message. But once consumed from the queue, the orchestration should kick in immediately and start processing the activities.

If I compare this to the current 6.0 solution (not isolated) the processing is very slow.

Actual behavior

Orchestration is created and a message is inserted into the TaskHub instances with the status pending. It can take up to 10 minutes before the message is picked up for processing.

Known workarounds

none

App Details

<PackageReference Include="Microsoft.Azure.Functions.Worker" Version="1.22.0" />
<PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.ServiceBus" Version="5.18.0" />
<PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Timer" Version="4.3.0" />
<PackageReference Include="Microsoft.Azure.Functions.Worker.Sdk" Version="1.17.3-preview1" />
<PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Http.AspNetCore" Version="1.2.1" />

I also have a question about this here : https://github.com/Azure/azure-functions-durable-extension/discussions/2808

But i think this is a bug because it behaves so much worse than .net 6.0

well0549 commented 1 month ago

I had some look into various diagnostics, i think it has to do with the algorithm that is in place when the function scales down. I think this is different from the standard functions (|Non isolated). It seems that when the instances go up, everything is fine. But when they scale down, I get a pause of 5 minutes on the process. This can happens also on .net 6.0 with non isolated functions. But the behaviour is much more smoother on 6.0 and I get way way less 5 minutes pauses. It seems like the scale controller will scale down when the instance is still picking up a lot of messages from the queue and putting them in the table storage with status pending. Then the pending instances have to wait for five minutes to be served. This would be logical for messages with processing, but status pending is waiting to be processed (so only if it's beeing processed ther should be a lock)..... Maybe the underlying mechanism should release the locks on pending messages first before killing the proces ?

When i put in a bunch of messages about 60 percent of them, finish fast and 30% takes a long time to complete.

Maybe there is some kql query I could run to get the information you need....

well0549 commented 1 month ago

I am willing to give the developers who will work on this access to my account so they can trigger the functions and test for themselves.

davidmrdavid commented 1 month ago

Hi @well0549 - can you please confirm what DurableTask package versions you're using? It's interesting that you mentioned the scale controller, as I was just looking at another thread that made a connection between .NET isolated and scale controller issues (https://github.com/microsoft/durabletask-netherite/issues/398) though that was in regards to failures. Still, trying to see if there's a connection, or if it's a coincidence.

On our end - what we need to investigate are usually:

A timerange in UTC when the issue was observed on Azure, the instanceID of the affected orchestrator, and the name of the app

well0549 commented 1 month ago

it's all c#, .net 8.0 running on windows consumption.

        <FrameworkReference Include="Microsoft.AspNetCore.App" />
        <PackageReference Include="Microsoft.Azure.Functions.Worker" Version="1.22.0" />
        <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Http" Version="3.1.0" />
        <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Http.AspNetCore" Version="1.2.1" />
        <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.ServiceBus" Version="5.18.0" />
        <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Timer" Version="4.3.0" />
        <PackageReference Include="Microsoft.Azure.Functions.Worker.Sdk" Version="1.17.3-preview1" />
        <PackageReference Include="Microsoft.Azure.Functions.Worker.ApplicationInsights" Version="1.2.0" />
        <PackageReference Include="Microsoft.ApplicationInsights.WorkerService" Version="2.22.0" />

I am currentl;y setting up application insights again an will provide the details soon

well0549 commented 1 month ago

One other thing i noticed is that the mentioned issue also assigns an instance id himself to : ScheduleNewOrchestrationInstanceAsync just like I do.

well0549 commented 1 month ago

@davidmrdavid -> Regarding your question regarding the version of durabletask, There is no reference anywhere to DurableTask. The references in my project are: image

well0549 commented 1 month ago

Ok, have deployed everything again, i stopped the function apps, because the appinsight logging will fill up my subscription quite fast.

Orchestration Instance Id = 8e6ada22112f43108bc079f7e752dbfd

below an example of a flow that executed as expected: image The message is written to the servicebus, The orchestration is created, and almost immediately the activities start.

This is an example where the delay is visible : image

The message is written to the servicebus, the trigger creates the orchestration and then the first activity starts after 5 minutes. The name of the function is : eig-eigdemo-cleanup

well0549 commented 1 month ago

Here is a sample of the Timeline: The initial delay of 5 minutes is visible there

image

well0549 commented 1 month ago

Start : image

Telling it is starting execution..... image

Then a 5 minute pause.,and starting processing image

well0549 commented 1 month ago

This is also visible in the Storage account of the function:

image

well0549 commented 1 month ago

Some more info :

image

And a happy flow : image

well0549 commented 1 month ago

@davidmrdavid it seems the message is picked up twice at exactly the same moment ? Dequeue count = 2 ? Then it will wait until the lock expires and start processing. ?

davidmrdavid commented 1 month ago

Hi @well0549: I have yet to take a detailed look, but I want to follow up to this comment first: https://github.com/Azure/azure-functions-durable-extension/issues/2813#issuecomment-2126344385

You mentioned that you have no reference to any DurableTask packages, but in that screenshot, inside the SNFC folder, I do see a DurableTask package (with version 1.2.0-rc.1). That seems like the Durable Functions/Durable Task worker extension package, which is exactly what I wanted to see :) . As you can see by the version suffix (the 'rc'), you're using a preview package, not the GA package (which is 1.1.3). Are you able to reproduce this issue with package version 1.1.3?

well0549 commented 1 month ago

Ok i downgraded to 1.1.3. Cleaned up all the taskhubs and removed storage tables and queues. What struck me is that with the preview version i would end up with 3 tables in the storage account.

image

The last folder is one that I don't get with 1.1.3.

well0549 commented 1 month ago

Did several test runs with 1.1.3. Performance is more stable.

I still see some 5 minutes pauses but i think this is on par with .net6 not isolated. (I think it now only happens when the instances cale down.)

The performance is much better. And the reason to go to the preview was to clean up the orchestration history. But that works with 1.1.3 as well.

well0549 commented 1 month ago

guess this is no longer relevant, (as long as the preview is not released)

well0549 commented 1 month ago

so closed

nytian commented 3 weeks ago

Hey @well0549 I am working on verification of the performance on dotnet isolated worker with package 1.2.0-rc.1. I queried with the instanceid 8e6ada22112f43108bc079f7e752dbfd and storage account eig-eigdemo-cleanup provided, but I only found few records which is not enough for investigations. I guess you might delete the history from your storage account, so I wonder do you have some other examples for me to identify the root cause? Thank you!

well0549 commented 3 weeks ago

Unfortunately i have to clean up a lot because my visual studio subscription will fill up with all the storage costs (and all other costs) so i regularly delete everything and create new. Since i went back to the non release packahe everything works a lot smoother now. I am currently running Microsoft.Azure.Functions.Worker.Extensions.DurableTask 1.1.4 and besides the occasional 5 minute hick-up, i dont the the behaviour anymore that I saw on the RC. I am willing to switch back to the RC, but then I must know when you will investigate, Because because of all the costs related to all the tests i clean up regularly