Azure / durabletask

Durable Task Framework allows users to write long running persistent workflows in C# using the async/await capabilities.
Apache License 2.0
1.51k stars 292 forks source link

Stuck orchestrations at random on control-queue #903

Open sainipankaj90k opened 1 year ago

sainipankaj90k commented 1 year ago

Hi, We are facing issues at random, not very frequent but once in 15 days around. All of a sudden, some control-queue gets stuck and orchestrations queued on it doesn't get processed until we restart the node itself. This is not near shutdown of service etc. Also, the lease ownership status shows success in the impacted duration for the control-queue as well.

No predictable repro so far, but quite consistent with above mentioned frequency.

Thanks, Pankaj

davidmrdavid commented 1 year ago

Thanks @sainipankaj90k. Just noting here for visibility that we discussed internally to work together to create a private release with extra logs to help get to the bottom of this issue. We can keep this issue open while the investigation is active.

leskil commented 1 year ago

Hi @davidmrdavid - wondering if you have any updates here as we are experiencing similar issues?

sainipankaj90k commented 1 year ago

So far, I tried to monitor the situation on the box and found some times the c# await tasks goes stuck. (Sounds very weird, but experienced it first hand).

Realized it is stuck at various calls, e.g., fetching applease from storage, stopping/starting taskhub worker. [Through our local testing, its not predictable but reproduced].

In DTF here, it is usually found stuck in 'GetMessagesAsync' of 'ControlQueue.cs' which are also using awaiting c# tasks.

Hypothesis so far is, these stuck c# await tasks are causing them. As per configure await 'https://learn.microsoft.com/en-us/dotnet/api/system.threading.tasks.task.configureawait?view=net-6.0', I think we should use 'ConfigureAwait' as false to reduce probability of stuck tasks.

Right now, I am planning of using configureawait as false in our code, and as per success we should have it used in this DTF library.

cgillum commented 1 year ago

Right now, I am planning of using configureawait as false in our code, and as per success we should have it used in this DTF library.

Please don't do this. It is likely to cause your orchestration to get stuck 100% of the time since orchestration code MUST always run in the orchestration's synchronization context.

The problem you're experiencing sounds more like an issue with the DurableTask.AzureStorage partition manager. Which version of the Microsoft.Azure.DurableTask.AzureStorage nuget package are you using?

sainipankaj90k commented 1 year ago

Interesting. For us, it is Microsoft.Azure.DurableTask.AzureStorage Version=1.13.6'.

leskil commented 1 year ago

We are on Microsoft.Azure.DurableTask.AzureStorage 1.12.0

cgillum commented 1 year ago

Okay, and just to confirm, are you still encountering this problem periodically with the latest version(s)?

sainipankaj90k commented 1 year ago

Yes, we are encountering these with latest versions too.

Another observation: We have multiple task-hub-worker (consider like microservices) at each node. We've found if control-queues are stuck, multiple control-queue on same node gets stuck.

could be awaiting thread issue causing deadlock. I see no other reason any async request to get stuck when no logic is such that which can keep it on hold so far.