Azure / durabletask

Durable Task Framework allows users to write long running persistent workflows in C# using the async/await capabilities.
Apache License 2.0
1.47k stars 287 forks source link

Functions timeouts not handled as they are in regular Azure Functions, leading to function host restarts and timers going missing #1114

Open ericleigh007 opened 2 weeks ago

ericleigh007 commented 2 weeks ago

Cross-posted from webjobs-extensions.

We have a timer trigger which is supposed to run every 4 minutes. For some reason, it failed to fire for a period of 32 minutes, and then resumed Unfortunately for us, this timer trigger handles a critical batching process, and we cannot afford such pauses in operation.

Repro steps Not sure this is easily reproducible, but I'll list the setup.

Provide the steps required to reproduce the problem

Create a timer trigger function that runs every 4 minutes.

Run the timer trigger function in a function at a scale of 10 instances.

Expected behavior The timer trigger function should execute every 4 minutes, without gaps.

Actual behavior The timer trigger function, at 2024-05-23T14:06:00Z in our instance, stops triggering on this instance and "pauses" At 2024-05-23T14:38:00Z, the timer trigger starts triggering on another instance. Up to this point, the It seems that the trigger is moving between different instances, sometime around every 30 minutes. For some reason, this 30-odd minute gap appears where no instance [guessing] acquires the lease, so none executes it?

image

Known workarounds None with timeouts causing hots restarts, but ultimate workaround -- set Timeout attribute to prevent the host from restarting.

Related information Microsoft.Net.Sdk.Functions 4.2.0 Microsoft.Azure.Functions.Extensions 1.1.0 Microsoft.Azure.WebJobs.Extensions.StorageBlobs 5.0.1 Runtime version ~4 64 bit .NET 6 Our function app is a very complex one that uses change feeds from cosmos db in an event driven pattern. Items are processed through change feed events from input to output, ending up mainly as file blobs in blob storage containers.

Timer triggers provide a batching mechanism for output files to be grouped to the required size for downstream systems.

Repro notes: Have NOT been able to duplicate this when Durable Functions extension not in use. Have a repro that has two timers -- one with Durable functions, the other without. Some evidence, though, that Trigger attribute not interpreted as it is in "normal" functions -- The "ThrowOnTimeout" attribute seems to be TRUE unless overridden, unlike normal functions, where it is documented to be, and verified to be FALSE.

Can provide some detailed tracing information in DM's.

Any thoughts, @davidmrdavid ?