Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
711 stars 263 forks source link

"DurableTask.AzureStorage.Storage.DurableTaskStorageException: Element X in the batch returned an unexpected response code" Exception while running multiple SubOrchestrators #2756

Open KonradCzarny opened 3 months ago

KonradCzarny commented 3 months ago

We have a process that creates multiple parallel SubOrchestrators - which consist of multiple parallel activities each. This functionality is scaled out across multiple workers, each with 4 vCPU. To give you some numbers:

The problem is we often get this particular error:

TaskOrchestrationDispatcher-b3d05808e92646ee98aeba4a639f3911-0: Unhandled exception with work item '546670575e414013a3c2ffcb94b2b811:22': DurableTask.AzureStorage.Storage.DurableTaskStorageException: Element 7 in the batch returned an unexpected response code.
 ---> Microsoft.WindowsAzure.Storage.StorageException: Element 7 in the batch returned an unexpected response code.
   at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.ExecuteAsyncInternal[T](RESTCommand`1 cmd, IRetryPolicy policy, OperationContext operationContext, CancellationToken token)
   at DurableTask.AzureStorage.TimeoutHandler.ExecuteWithTimeout[T](String operationName, String account, AzureStorageOrchestrationServiceSettings settings, Func`3 operation, AzureStorageOrchestrationServiceStats stats, String clientRequestId)
   at DurableTask.AzureStorage.Storage.AzureStorageClient.MakeStorageRequest[T](Func`3 storageRequest, String accountName, String operationName, String clientRequestId, Boolean force)
Request Information
RequestID:9180912d-e002-002f-231c-70b493000000
RequestDate:Wed, 06 Mar 2024 23:19:31 GMT
StatusMessage:7:The update condition specified in the request was not satisfied.
ErrorCode:
ErrorMessage:7:The update condition specified in the request was not satisfied.
RequestId:9180912d-e002-002f-231c-70b493000000
Time:2024-03-06T23:19:31.9715400Z

   --- End of inner exception stack trace ---
   at DurableTask.AzureStorage.Storage.AzureStorageClient.MakeStorageRequest[T](Func`3 storageRequest, String accountName, String operationName, String clientRequestId, Boolean force) in /_/src/DurableTask.AzureStorage/Storage/AzureStorageClient.cs:line 141
   at DurableTask.AzureStorage.Storage.Table.ExecuteBatchAsync(TableBatchOperation batchOperation, String batchType) in /_/src/DurableTask.AzureStorage/Storage/Table.cs:line 172
   at DurableTask.AzureStorage.Tracking.AzureTableTrackingStore.UploadHistoryBatch(String instanceId, String sanitizedInstanceId, String executionId, TableBatchOperation historyEventBatch, StringBuilder historyEventNamesBuffer, Int32 numberOfTotalEvents, Int32 episodeNumber, Int32 estimatedBatchSizeInBytes, String eTagValue, Boolean isFinalBatch) in /_/src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs:line 1323
   at DurableTask.AzureStorage.Tracking.AzureTableTrackingStore.UpdateStateAsync(OrchestrationRuntimeState newRuntimeState, OrchestrationRuntimeState oldRuntimeState, String instanceId, String executionId, String eTagValue, Object trackingStoreContext) in /_/src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs:line 1098
   at DurableTask.AzureStorage.AzureStorageOrchestrationService.CompleteTaskOrchestrationWorkItemAsync(TaskOrchestrationWorkItem workItem, OrchestrationRuntimeState newOrchestrationRuntimeState, IList`1 outboundMessages, IList`1 orchestratorMessages, IList`1 timerMessages, TaskMessage continuedAsNewMessage, OrchestrationState orchestrationState) in /_/src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs:line 1055
   at DurableTask.Core.TaskOrchestrationDispatcher.OnProcessWorkItemAsync(TaskOrchestrationWorkItem workItem) in /_/src/DurableTask.Core/TaskOrchestrationDispatcher.cs:line 578
   at DurableTask.Core.TaskOrchestrationDispatcher.OnProcessWorkItemSessionAsync(TaskOrchestrationWorkItem workItem) in /_/src/DurableTask.Core/TaskOrchestrationDispatcher.cs:line 194
   at DurableTask.Core.WorkItemDispatcher`1.ProcessWorkItemAsync(WorkItemDispatcherContext context, Object workItemObj) in /_/src/DurableTask.Core/WorkItemDispatcher.cs:line 373

Backing off for 10 seconds until 5 successful operations

Instance Id mentioned in it, points to either SubOrchestrator or main orchestrator - depending on instance of an error.

While investigating this issue, I've found similar threads, describing it as Split-brain error - and explaining situation as 2 workers fighting for one resource - from which only one worker wins, and the other one reports an error like this. This would make sense - since while checking particular InvocationId inside Durable Functions Monitor - it shows orchestration instance as completed successfully.

The same explanations mentioned that it is something that we can hardly mitigate ourselves, and system should automatically recover from these. So my questions are:

1) What could be the potential fix for that, if any 2) If it will be impossible to fix that - is there any way to Suppress this error from showing in Logs? Since it is something internal, I can hardly find a way to do it.

leviettung200 commented 1 month ago

@bachuv Please investigate this issue; we have encountered an exception in production. How can I expedite the processing?

TaskOrchestrationDispatcher-c61e8572448f4a0eac5cd5158bd02ef0-0: Unhandled exception with work item 'ce6e21055abe493abb66e33d69d5c22d': DurableTask.AzureStorage.Storage.DurableTaskStorageException: Element 4 in the batch returned an unexpected response code.
 ---> Microsoft.WindowsAzure.Storage.StorageException: Element 4 in the batch returned an unexpected response code.
   at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.ExecuteAsyncInternal[T](RESTCommand`1 cmd, IRetryPolicy policy, OperationContext operationContext, CancellationToken token)
   at DurableTask.AzureStorage.TimeoutHandler.ExecuteWithTimeout[T](String operationName, String account, AzureStorageOrchestrationServiceSettings settings, Func`3 operation, AzureStorageOrchestrationServiceStats stats, String clientRequestId)
   at DurableTask.AzureStorage.Storage.AzureStorageClient.MakeStorageRequest[T](Func`3 storageRequest, String accountName, String operationName, String clientRequestId, Boolean force)
Request Information
RequestID:ce9599e6-7002-004a-1704-adab49000000
RequestDate:Thu, 23 May 2024 11:32:06 GMT
StatusMessage:4:The specified entity already exists.
ErrorCode:
ErrorMessage:4:The specified entity already exists.
RequestId:ce9599e6-7002-004a-1704-adab49000000
Time:2024-05-23T11:32:06.8919651Z

   --- End of inner exception stack trace ---
   at DurableTask.AzureStorage.Storage.AzureStorageClient.MakeStorageRequest[T](Func`3 storageRequest, String accountName, String operationName, String clientRequestId, Boolean force) in /_/src/DurableTask.AzureStorage/Storage/AzureStorageClient.cs:line 140
   at DurableTask.AzureStorage.Storage.Table.ExecuteBatchAsync(TableBatchOperation batchOperation, String batchType) in /_/src/DurableTask.AzureStorage/Storage/Table.cs:line 172
   at DurableTask.AzureStorage.Tracking.AzureTableTrackingStore.UploadHistoryBatch(String instanceId, String sanitizedInstanceId, String executionId, TableBatchOperation historyEventBatch, StringBuilder historyEventNamesBuffer, Int32 numberOfTotalEvents, Int32 episodeNumber, Int32 estimatedBatchSizeInBytes, String eTagValue, Boolean isFinalBatch) in /_/src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs:line 1289
   at DurableTask.AzureStorage.Tracking.AzureTableTrackingStore.UpdateStateAsync(OrchestrationRuntimeState newRuntimeState, OrchestrationRuntimeState oldRuntimeState, String instanceId, String executionId, String eTagValue) in /_/src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs:line 1085
   at DurableTask.AzureStorage.AzureStorageOrchestrationService.CompleteTaskOrchestrationWorkItemAsync(TaskOrchestrationWorkItem workItem, OrchestrationRuntimeState newOrchestrationRuntimeState, IList`1 outboundMessages, IList`1 orchestratorMessages, IList`1 timerMessages, TaskMessage continuedAsNewMessage, OrchestrationState orchestrationState) in /_/src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs:line 1055
   at DurableTask.Core.TaskOrchestrationDispatcher.OnProcessWorkItemAsync(TaskOrchestrationWorkItem workItem) in /_/src/DurableTask.Core/TaskOrchestrationDispatcher.cs:line 551
   at DurableTask.Core.TaskOrchestrationDispatcher.OnProcessWorkItemSessionAsync(TaskOrchestrationWorkItem workItem) in /_/src/DurableTask.Core/TaskOrchestrationDispatcher.cs:line 194
   at DurableTask.Core.WorkItemDispatcher`1.ProcessWorkItemAsync(WorkItemDispatcherContext context, Object workItemObj) in /_/src/DurableTask.Core/WorkItemDispatcher.cs:line 373

Backing off for 10 seconds until 5 successful operations