Azure / durabletask

Durable Task Framework allows users to write long running persistent workflows in C# using the async/await capabilities.
Apache License 2.0
1.47k stars 287 forks source link

[DurableTask-AzureStorage] Eternal Orchestration Stuck and Consistently Abandoning the Message #1019

Open ykhazbak opened 6 months ago

ykhazbak commented 6 months ago

Eternal Orchestration "SiteNetworkServiceStateBillingOrchestrator" started execution and then got stuck while processing a message after lease re-assignment.

The partition "ansmsitenetworkservicehub-control-06" was reassigned to worker node "_armBEaz_11" from worker node "_armBEaz_10", and just after the lease re-assignment, the worker node "armBEaz11" was never able to process one message of (TimerFired Event) and consistently abandoning the message for days.

The orchestration is stuck at line 114 of the code below, note that four task activities were already executed at this point: image

Logs: https://jarvis-int-west.microsoftgeneva.com/E06F8A5F https://jarvis-int-west.microsoftgeneva.com/8D6D9236

Instance Id: 613e83a4-eb15-42c6-aa12-329f0e215894:SiteNetworkServiceStateBillingOrchestrator:V1 Event Type: TimerFired

image image

Can someone help identify if this is a race condition? And how we can solve this? This is a billing orchestration which runs periodically, and it is very important to ensure it runs smoothly and consistently emitting billing events.