Azure / durabletask

Durable Task Framework allows users to write long running persistent workflows in C# using the async/await capabilities.
Apache License 2.0
1.51k stars 290 forks source link

Enhancement: Easier correlation of Core Log events / Azure Storage back end #914

Open rafaeldnferreira opened 1 year ago

rafaeldnferreira commented 1 year ago

Scenario

I've had recently to debug delays on orchestration scheduling in our service, and by having to inspect for the first time in deeper details various events, I realized that service owners could benefit from additional data/columns on various existing Events that are generated as orchestrations/tasks are decomposed and processed. The below is a compiled list of suggestions to be added:

Durable Task Core

FetchingThrottled

Inclusion of the TaskHub name would make it easier to identify its origin, currently only the Dispatcher and the PID are available to be used as potential keys, which makes it not so intuitive to correlate with other logs.

FetchWorkItemCompleted

The only column available that gives any indication of what is being done is the Work Item ID, it would be easier if a Work item Description could give a human readable representation of what the particular work item is. Examples: The name of the TaskAcitivity, the Control flow (orchestration's name), Timer etc...

Azure Storage

ProcessWorkItemFailed

When a work item fails due a a batch operation failed, it would be very useful to identify which work item has caused the batch operation to fail, currently we see the exception details in a format like:

DurableTask.AzureStorage.Storage.DurableTaskStorageException: Element 4 in the batch returned an unexpected response code.
 ---> Microsoft.WindowsAzure.Storage.StorageException: Element 4 in the batch returned an unexpected response code.
   at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.ExecuteAsyncInternal[T](RESTCommand`1 cmd, IRetryPolicy policy, OperationContext operationContext, CancellationToken token)
   at DurableTask.AzureStorage.TimeoutHandler.ExecuteWithTimeout[T](String operationName, String account, AzureStorageOrchestrationServiceSettings settings, Func`3 operation, AzureStorageOrchestrationServiceStats stats, String clientRequestId) in /_/src/DurableTask.AzureStorage/TimeoutHandler.cs:line 143
   at DurableTask.AzureStorage.Storage.AzureStorageClient.MakeStorageRequest[T](Func`3 storageRequest, String accountName, String operationName, String clientRequestId, Boolean force) in /_/src/DurableTask.AzureStorage/Storage/AzureStorageClient.cs:line 149
Request Information
RequestID:85b2717d-0002-0000-4246-9d7f66000000
RequestDate:Mon, 12 Jun 2023 15:55:21 GMT
StatusMessage:4:The specified entity already exists.
ErrorCode:
ErrorMessage:4:The specified entity already exists.
RequestId:85b2717d-0002-0000-4246-9d7f66000000
Time:2023-06-12T15:55:21.9700388Z

It isn't easy to track who was the work item at the position 4 in the example above, neither to conclude the root cause. Was it indeed a split brain/message being processed twice under the same node? An enriched version of this stack trace would contain relevant information that would allow the work item to be identified/element on the history table that has caused the issue.

Another sample:

DurableTask.AzureStorage.Storage.DurableTaskStorageException: Element 5 in the batch returned an unexpected response code.
 ---> Microsoft.WindowsAzure.Storage.StorageException: Element 5 in the batch returned an unexpected response code.
   at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.ExecuteAsyncInternal[T](RESTCommand`1 cmd, IRetryPolicy policy, OperationContext operationContext, CancellationToken token)
   at DurableTask.AzureStorage.TimeoutHandler.ExecuteWithTimeout[T](String operationName, String account, AzureStorageOrchestrationServiceSettings settings, Func`3 operation, AzureStorageOrchestrationServiceStats stats, String clientRequestId) in /_/src/DurableTask.AzureStorage/TimeoutHandler.cs:line 143
   at DurableTask.AzureStorage.Storage.AzureStorageClient.MakeStorageRequest[T](Func`3 storageRequest, String accountName, String operationName, String clientRequestId, Boolean force) in /_/src/DurableTask.AzureStorage/Storage/AzureStorageClient.cs:line 149
Request Information
RequestID:2f21a709-0002-002c-0e48-9d1a4c000000
RequestDate:Mon, 12 Jun 2023 16:08:03 GMT
StatusMessage:5:The update condition specified in the request was not satisfied.
ErrorCode:
ErrorMessage:5:The update condition specified in the request was not satisfied.
RequestId:2f21a709-0002-002c-0e48-9d1a4c000000
Time:2023-06-12T16:08:03.0858288Z
mitchellrust commented 7 months ago

We're also experiencing difficulty in correlating this information, would be very useful to have that detail included.

georgeOsdDev commented 4 months ago

We are experiencing same error.