Azure / durabletask

Durable Task Framework allows users to write long running persistent workflows in C# using the async/await capabilities.
Apache License 2.0
1.51k stars 289 forks source link

context.CreateTimer delaying more than expected #833

Open yell0wfl4sh opened 1 year ago

yell0wfl4sh commented 1 year ago

context.CreateTimer is causing extra delay upto 2hrs in few executions. What are the scenarios where this can happen?

delta between expected and actual time in mins timechart:

image

cgillum commented 1 year ago

context.CreateTimer adds a message to the appropriate orchestrator queue that becomes visible at the specified time. If you're seeing a delay, it means there is some problem preventing the message from being processed at the expected time. Your best bet for debugging is to look through the logs to see what might be going on. https://github.com/Azure/durabletask/tree/main/src/DurableTask.Core/Logging#readme

yell0wfl4sh commented 1 year ago

Hi @cgillum,

Could you please help me in comprehending the potential overhead associated with a large number of orchestrations in a sleep state that utilize context.CreateTimer? Is there a recommended maximum time duration for employing context.CreateTimer? Additionally, is there a limit to the number of orchestrations that can be put to sleep using context.CreateTimer?

We are contemplating the implementation outlined in this GitHub issue in our system: https://github.com/Azure/azure-functions-durable-extension/issues/2287. Essentially, our aim is to circumvent the occurrence of non-deterministic exceptions in the case of lengthy running orchestrations. This is done with the added consideration of potentially reducing any CPU overhead that might arise due to the prolonged use of context.CreateTimer.

Our current strategy involves replicating the internal functionality of context.CreateTimer, which involves queuing a message that becomes visible at a specified time. However, if there are no discernible advantages over utilizing the existing infrastructure of context.CreateTimer, we are inclined to create a new orchestrator that will sleep until the designated time, subsequently triggering a callback or initiating a new orchestrator process.

Thank you for your guidance and insights.

cgillum commented 1 year ago

Could you please help me in comprehending the potential overhead associated with a large number of orchestrations in a sleep state that utilize context.CreateTimer?

It depends on which storage backend you're using. If you're using the Azure Storage backend (which is the most common one), then there's no overhead. When waiting for a timer, the orchestration can be completely unloaded from memory and there's zero (additional) polling that needs to be done for each new timer. With a backend like MSSQL, however, there is some overhead because there are more rows in the database which can impact the performance of scan operations when looking for new work.

Is there a recommended maximum time duration for employing context.CreateTimer?

There's no enforced maximum for how long a timer can be. However, Azure Storage doesn't allow messages to sit in the queue for longer than 7 days, so we have to work around this by scheduling multiple timers in sequence if you need to schedule a long timer. For example, if you schedule a timer for 10 days, we'll actually schedule a timer for just 3 days, schedule another timer for 3 days after the first one fires, etc., until we reach 10 days of total waiting. There is some overhead for this because each time one of these sub-timers fires, the orchestration must be loaded so that the timer can be processed and a new timer scheduled.