Open sainipankaj90k opened 1 year ago
Thanks @sainipankaj90k. Just noting here for visibility that we discussed internally to work together to create a private release with extra logs to help get to the bottom of this issue. We can keep this issue open while the investigation is active.
Hi @davidmrdavid - wondering if you have any updates here as we are experiencing similar issues?
So far, I tried to monitor the situation on the box and found some times the c# await tasks goes stuck. (Sounds very weird, but experienced it first hand).
Realized it is stuck at various calls, e.g., fetching applease from storage, stopping/starting taskhub worker. [Through our local testing, its not predictable but reproduced].
In DTF here, it is usually found stuck in 'GetMessagesAsync' of 'ControlQueue.cs' which are also using awaiting c# tasks.
Hypothesis so far is, these stuck c# await tasks are causing them. As per configure await 'https://learn.microsoft.com/en-us/dotnet/api/system.threading.tasks.task.configureawait?view=net-6.0', I think we should use 'ConfigureAwait' as false to reduce probability of stuck tasks.
Right now, I am planning of using configureawait as false in our code, and as per success we should have it used in this DTF library.
Right now, I am planning of using configureawait as false in our code, and as per success we should have it used in this DTF library.
Please don't do this. It is likely to cause your orchestration to get stuck 100% of the time since orchestration code MUST always run in the orchestration's synchronization context.
The problem you're experiencing sounds more like an issue with the DurableTask.AzureStorage partition manager. Which version of the Microsoft.Azure.DurableTask.AzureStorage
nuget package are you using?
Interesting. For us, it is Microsoft.Azure.DurableTask.AzureStorage Version=1.13.6'.
We are on Microsoft.Azure.DurableTask.AzureStorage 1.12.0
Okay, and just to confirm, are you still encountering this problem periodically with the latest version(s)?
Yes, we are encountering these with latest versions too.
Another observation: We have multiple task-hub-worker (consider like microservices) at each node. We've found if control-queues are stuck, multiple control-queue on same node gets stuck.
could be awaiting thread issue causing deadlock. I see no other reason any async request to get stuck when no logic is such that which can keep it on hold so far.
Hi, We are facing issues at random, not very frequent but once in 15 days around. All of a sudden, some control-queue gets stuck and orchestrations queued on it doesn't get processed until we restart the node itself. This is not near shutdown of service etc. Also, the lease ownership status shows success in the impacted duration for the control-queue as well.
No predictable repro so far, but quite consistent with above mentioned frequency.
Thanks, Pankaj