Open KonradCzarny opened 3 months ago
@bachuv Please investigate this issue; we have encountered an exception in production. How can I expedite the processing?
TaskOrchestrationDispatcher-c61e8572448f4a0eac5cd5158bd02ef0-0: Unhandled exception with work item 'ce6e21055abe493abb66e33d69d5c22d': DurableTask.AzureStorage.Storage.DurableTaskStorageException: Element 4 in the batch returned an unexpected response code.
---> Microsoft.WindowsAzure.Storage.StorageException: Element 4 in the batch returned an unexpected response code.
at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.ExecuteAsyncInternal[T](RESTCommand`1 cmd, IRetryPolicy policy, OperationContext operationContext, CancellationToken token)
at DurableTask.AzureStorage.TimeoutHandler.ExecuteWithTimeout[T](String operationName, String account, AzureStorageOrchestrationServiceSettings settings, Func`3 operation, AzureStorageOrchestrationServiceStats stats, String clientRequestId)
at DurableTask.AzureStorage.Storage.AzureStorageClient.MakeStorageRequest[T](Func`3 storageRequest, String accountName, String operationName, String clientRequestId, Boolean force)
Request Information
RequestID:ce9599e6-7002-004a-1704-adab49000000
RequestDate:Thu, 23 May 2024 11:32:06 GMT
StatusMessage:4:The specified entity already exists.
ErrorCode:
ErrorMessage:4:The specified entity already exists.
RequestId:ce9599e6-7002-004a-1704-adab49000000
Time:2024-05-23T11:32:06.8919651Z
--- End of inner exception stack trace ---
at DurableTask.AzureStorage.Storage.AzureStorageClient.MakeStorageRequest[T](Func`3 storageRequest, String accountName, String operationName, String clientRequestId, Boolean force) in /_/src/DurableTask.AzureStorage/Storage/AzureStorageClient.cs:line 140
at DurableTask.AzureStorage.Storage.Table.ExecuteBatchAsync(TableBatchOperation batchOperation, String batchType) in /_/src/DurableTask.AzureStorage/Storage/Table.cs:line 172
at DurableTask.AzureStorage.Tracking.AzureTableTrackingStore.UploadHistoryBatch(String instanceId, String sanitizedInstanceId, String executionId, TableBatchOperation historyEventBatch, StringBuilder historyEventNamesBuffer, Int32 numberOfTotalEvents, Int32 episodeNumber, Int32 estimatedBatchSizeInBytes, String eTagValue, Boolean isFinalBatch) in /_/src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs:line 1289
at DurableTask.AzureStorage.Tracking.AzureTableTrackingStore.UpdateStateAsync(OrchestrationRuntimeState newRuntimeState, OrchestrationRuntimeState oldRuntimeState, String instanceId, String executionId, String eTagValue) in /_/src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs:line 1085
at DurableTask.AzureStorage.AzureStorageOrchestrationService.CompleteTaskOrchestrationWorkItemAsync(TaskOrchestrationWorkItem workItem, OrchestrationRuntimeState newOrchestrationRuntimeState, IList`1 outboundMessages, IList`1 orchestratorMessages, IList`1 timerMessages, TaskMessage continuedAsNewMessage, OrchestrationState orchestrationState) in /_/src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs:line 1055
at DurableTask.Core.TaskOrchestrationDispatcher.OnProcessWorkItemAsync(TaskOrchestrationWorkItem workItem) in /_/src/DurableTask.Core/TaskOrchestrationDispatcher.cs:line 551
at DurableTask.Core.TaskOrchestrationDispatcher.OnProcessWorkItemSessionAsync(TaskOrchestrationWorkItem workItem) in /_/src/DurableTask.Core/TaskOrchestrationDispatcher.cs:line 194
at DurableTask.Core.WorkItemDispatcher`1.ProcessWorkItemAsync(WorkItemDispatcherContext context, Object workItemObj) in /_/src/DurableTask.Core/WorkItemDispatcher.cs:line 373
Backing off for 10 seconds until 5 successful operations
We have a process that creates multiple parallel SubOrchestrators - which consist of multiple parallel activities each. This functionality is scaled out across multiple workers, each with 4 vCPU. To give you some numbers:
The problem is we often get this particular error:
Instance Id mentioned in it, points to either SubOrchestrator or main orchestrator - depending on instance of an error.
While investigating this issue, I've found similar threads, describing it as Split-brain error - and explaining situation as 2 workers fighting for one resource - from which only one worker wins, and the other one reports an error like this. This would make sense - since while checking particular InvocationId inside Durable Functions Monitor - it shows orchestration instance as completed successfully.
The same explanations mentioned that it is something that we can hardly mitigate ourselves, and system should automatically recover from these. So my questions are:
1) What could be the potential fix for that, if any 2) If it will be impossible to fix that - is there any way to Suppress this error from showing in Logs? Since it is something internal, I can hardly find a way to do it.