Azure / durabletask

Durable Task Framework allows users to write long running persistent workflows in C# using the async/await capabilities.
Apache License 2.0
1.52k stars 295 forks source link

Recurrent task with expiration > 7 days fails but continually retries without any indication to the user #1006

Open papadeltasierra opened 1 year ago

papadeltasierra commented 1 year ago

We wrote a recurring orchestrator, recurring every 15 minutes, and wanted it to run for a long time. We set the expiration to 15days and this seemed to work but... We later discovered that the first iteration of the orchestrator happened but subsequent iterations did not and there was no indication of an error back to our code and no obvious logs. We were eventually able to find that indicated that the DTF failing to run the orchestration and then retrying, and DTF continued to do this every 10 minutes, forever!

Having debugged this (against tagged release durabletask.azurestorage-v1.13.6) we have traced the issue to this Exception.

System.ArgumentOutOfRangeException: The argument 'initialVisibilityDelay' is larger than maximum of '7.00:00:00' (Parameter 'initialVisibilityDelay')
   at Microsoft.WindowsAzure.Storage.Core.Util.CommonUtility.AssertInBounds[T](String paramName, T val, T min, T max)
   at Microsoft.WindowsAzure.Storage.Queue.CloudQueue.AddMessageImpl(CloudQueueMessage message, Nullable`1 timeToLive, Nullable`1 initialVisibilityDelay, QueueRequestOptions options)
   at Microsoft.WindowsAzure.Storage.Queue.CloudQueue.AddMessageAsync(CloudQueueMessage message, Nullable`1 timeToLive, Nullable`1 initialVisibilityDelay, QueueRequestOptions options, OperationContext operationContext, CancellationToken cancellationToken)
   at Microsoft.WindowsAzure.Storage.Queue.CloudQueue.AddMessageAsync(CloudQueueMessage message, Nullable`1 timeToLive, Nullable`1 initialVisibilityDelay, QueueRequestOptions options, OperationContext operationContext)
   at DurableTask.AzureStorage.Storage.Queue.<>c__DisplayClass12_0.<AddMessageAsync>b__0(OperationContext context, CancellationToken cancellationToken) in /_/src/DurableTask.AzureStorage/Storage/Queue.cs:line 55
   at DurableTask.AzureStorage.Storage.AzureStorageClient.WrapFunctionWithReturnType(Func`3 storageRequest, OperationContext context, CancellationToken cancellationToken) in /_/src/DurableTask.AzureStorage/Storage/AzureStorageClient.cs:line 159
   at DurableTask.AzureStorage.TimeoutHandler.ExecuteWithTimeout[T](String operationName, String account, AzureStorageOrchestrationServiceSettings settings, Func`3 operation, AzureStorageOrchestrationServiceStats stats, String clientRequestId) in /_/src/DurableTask.AzureStorage/TimeoutHandler.cs:line 133
   at DurableTask.AzureStorage.Storage.AzureStorageClient.MakeStorageRequest[T](Func`3 storageRequest, String accountName, String operationName, String clientRequestId, Boolean force) in /_/src/DurableTask.AzureStorage/Storage/AzureStorageClient.cs:line 150
   at DurableTask.AzureStorage.Storage.Queue.AddMessageAsync(QueueMessage queueMessage, Nullable`1 visibilityDelay, Nullable`1 clientRequestId) in /_/src/DurableTask.AzureStorage/Storage/Queue.cs:line 54
   at DurableTask.AzureStorage.Messaging.TaskHubQueue.AddMessageAsync(TaskMessage taskMessage, OrchestrationInstance sourceInstance, SessionBase session) in /_/src/DurableTask.AzureStorage/Messaging/TaskHubQueue.cs:line 129
   at DurableTask.AzureStorage.Utils.InvokeThrottledAction[T](T item, Func`2 action, SemaphoreSlim semaphore) in /_/src/DurableTask.AzureStorage/Utils.cs:line 81
   at DurableTask.AzureStorage.Utils.ParallelForEachAsync[T](IList`1 items, Int32 maxConcurrency, Func`2 action) in /_/src/DurableTask.AzureStorage/Utils.cs:line 72
   at DurableTask.AzureStorage.AzureStorageOrchestrationService.CommitOutboundQueueMessages(OrchestrationSession session, IList`1 outboundMessages, IList`1 orchestratorMessages, IList`1 timerMessages, TaskMessage continuedAsNewMessage) in /_/src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs:line 1251
   at DurableTask.AzureStorage.AzureStorageOrchestrationService.CompleteTaskOrchestrationWorkItemAsync(TaskOrchestrationWorkItem workItem, OrchestrationRuntimeState newOrchestrationRuntimeState, IList`1 outboundMessages, IList`1 orchestratorMessages, IList`1 timerMessages, TaskMessage continuedAsNewMessage, OrchestrationState orchestrationState) in /_/src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs:line 1037
   at DurableTask.Core.TaskOrchestrationDispatcher.OnProcessWorkItemAsync(TaskOrchestrationWorkItem workItem) in /_/src/DurableTask.Core/TaskOrchestrationDispatcher.cs:line 578
   at DurableTask.Core.TaskOrchestrationDispatcher.OnProcessWorkItemSessionAsync(TaskOrchestrationWorkItem workItem) in /_/src/DurableTask.Core/TaskOrchestrationDispatcher.cs:line 194
   at DurableTask.Core.WorkItemDispatcher`1.ProcessWorkItemAsync(WorkItemDispatcherContext context, Object workItemObj) in /_/src/DurableTask.Core/WorkItemDispatcher.cs:line 373 | formatter=TaskOrchestrationDispatcher-da755bb068694a5d8c860100a5800fdd-0: Unhandled exception with work item 'f810cf8a-5e75-404d-95db-f84a07881dc3:DummyTaskOrchestration:V1': System.ArgumentOutOfRangeException: The argument 'initialVisibilityDelay' is larger than maximum of '7.00:00:00' (Parameter 'initialVisibilityDelay')
   at Microsoft.WindowsAzure.Storage.Core.Util.CommonUtility.AssertInBounds[T](String paramName, T val, T min, T max)
   at Microsoft.WindowsAzure.Storage.Queue.CloudQueue.AddMessageImpl(CloudQueueMessage message, Nullable`1 timeToLive, Nullable`1 initialVisibilityDelay, QueueRequestOptions options)
   at Microsoft.WindowsAzure.Storage.Queue.CloudQueue.AddMessageAsync(CloudQueueMessage message, Nullable`1 timeToLive, Nullable`1 initialVisibilityDelay, QueueRequestOptions options, OperationContext operationContext, CancellationToken cancellationToken)
   at Microsoft.WindowsAzure.Storage.Queue.CloudQueue.AddMessageAsync(CloudQueueMessage message, Nullable`1 timeToLive, Nullable`1 initialVisibilityDelay, QueueRequestOptions options, OperationContext operationContext)
   at DurableTask.AzureStorage.Storage.Queue.<>c__DisplayClass12_0.<AddMessageAsync>b__0(OperationContext context, CancellationToken cancellationToken) in /_/src/DurableTask.AzureStorage/Storage/Queue.cs:line 55
   at DurableTask.AzureStorage.Storage.AzureStorageClient.WrapFunctionWithReturnType(Func`3 storageRequest, OperationContext context, CancellationToken cancellationToken) in /_/src/DurableTask.AzureStorage/Storage/AzureStorageClient.cs:line 159
   at DurableTask.AzureStorage.TimeoutHandler.ExecuteWithTimeout[T](String operationName, String account, AzureStorageOrchestrationServiceSettings settings, Func`3 operation, AzureStorageOrchestrationServiceStats stats, String clientRequestId) in /_/src/DurableTask.AzureStorage/TimeoutHandler.cs:line 133
   at DurableTask.AzureStorage.Storage.AzureStorageClient.MakeStorageRequest[T](Func`3 storageRequest, String accountName, String operationName, String clientRequestId, Boolean force) in /_/src/DurableTask.AzureStorage/Storage/AzureStorageClient.cs:line 150
   at DurableTask.AzureStorage.Storage.Queue.AddMessageAsync(QueueMessage queueMessage, Nullable`1 visibilityDelay, Nullable`1 clientRequestId) in /_/src/DurableTask.AzureStorage/Storage/Queue.cs:line 54
   at DurableTask.AzureStorage.Messaging.TaskHubQueue.AddMessageAsync(TaskMessage taskMessage, OrchestrationInstance sourceInstance, SessionBase session) in /_/src/DurableTask.AzureStorage/Messaging/TaskHubQueue.cs:line 129
   at DurableTask.AzureStorage.Utils.InvokeThrottledAction[T](T item, Func`2 action, SemaphoreSlim semaphore) in /_/src/DurableTask.AzureStorage/Utils.cs:line 81
   at DurableTask.AzureStorage.Utils.ParallelForEachAsync[T](IList`1 items, Int32 maxConcurrency, Func`2 action) in /_/src/DurableTask.AzureStorage/Utils.cs:line 72
   at DurableTask.AzureStorage.AzureStorageOrchestrationService.CommitOutboundQueueMessages(OrchestrationSession session, IList`1 outboundMessages, IList`1 orchestratorMessages, IList`1 timerMessages, TaskMessage continuedAsNewMessage) in /_/src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs:line 1251
   at DurableTask.AzureStorage.AzureStorageOrchestrationService.CompleteTaskOrchestrationWorkItemAsync(TaskOrchestrationWorkItem workItem, OrchestrationRuntimeState newOrchestrationRuntimeState, IList`1 outboundMessages, IList`1 orchestratorMessages, IList`1 timerMessages, TaskMessage continuedAsNewMessage, OrchestrationState orchestrationState) in /_/src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs:line 1037
   at DurableTask.Core.TaskOrchestrationDispatcher.OnProcessWorkItemAsync(TaskOrchestrationWorkItem workItem) in /_/src/DurableTask.Core/TaskOrchestrationDispatcher.cs:line 578
   at DurableTask.Core.TaskOrchestrationDispatcher.OnProcessWorkItemSessionAsync(TaskOrchestrationWorkItem workItem) in /_/src/DurableTask.Core/TaskOrchestrationDispatcher.cs:line 194
   at DurableTask.Core.WorkItemDispatcher`1.ProcessWorkItemAsync(WorkItemDispatcherContext context, Object workItemObj) in /_/src/DurableTask.Core/WorkItemDispatcher.cs:line 373

The call to CloudQueue.AddMessageAsync Method, https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.storage.queue.cloudqueue.addmessageasync?view=azure-dotnet-legacy#definition, seems to have an undocumented limit on the initialVisibilityDelay of 7 days. Setting <=7 days works, setting anything more fails with the error above.

papadeltasierra commented 1 year ago

We traced this back through the DTF code (current code seems similar to the tagged release that we were using) and can suggest at least three possible solutions.

  1. In src\DurableTask.AzureStorage\Storage\AzureStorageClient.cs, the code already catches StorageException but does no special processing for ArgumentOutOfRangeException. It is assumed that the DurableTaskStorageException would flow back to the user so perhaps the catch statement could be extended. However it would also be useful if the error could map from the Azure storage information (which the user might know little about) to the actual DTF field that the user has set incorrectly and needs to change.
  2. Allow the ArgumentOutOfRangeException to flow back up the stack but now consider such exceptions as FATAL (ifFatal in src\DurableTask.Core\Common\Utils.cs) so that src\DurableTask.Core\WorkItemDispatcher.cs, line 405, will not just keep retrying the orchestration/task. The issue with this is that the ArgumentOutOfRangeException returned to the user will again indicate a field that the user knows nothing about (initialVisibilityDelay).
  3. Implement logic that handles long expiration times of >7 days by submitting expirations of 7 days and then repeating as the 7-day expiration occurs until the required long expiration has completed.
steveland83 commented 2 months ago

Adding additional context to this bug. We encountered the same problem, where we were scheduling an activity to run 14 days later, and similarly did not see any indication of an error - however over a few weeks our system's performance degraded dramatically (from around a second per orchestration to sometimes more than an hour).

What may be different for us is that we are using durable entities, and interactions with these entities have become shockingly slow. The log below timed an entity operation (in milliseconds). image

We have noted in our investigation that although there are 158 queued messages in the control-worker-01 queue (see image 1 below) the worker process does not appear to pick them up (image 2, for the same period). While we understand that many of these may be queued with a visibility delay (despite the errors) we would expect that should not impact the processing of messages which are queued without any visibility delay - however it appears that they are impacted.

Also worth noting that we only encountered the problem when we migrated to an isolated function running DotNet 8 - previously we ran in-process functions on DotNet 6 and there was no problem (over 2 years).

  1. Storage account showing 158 queued messages, some of which are visible (note that the messages displayed are erratic): image

  2. control worker logs showing no new messages found in the same period (NB. valid jobs were queued) image

  3. exception message shows that each time this exception occurs the worker waits 10s before continuing - which will compound the problem as the messages grow. image

  4. Example of message content - probably not relevant image

  5. Overall workflow trace showing the resultant delay (around 15m for this case) image