Open jviau opened 2 years ago
One thing we've received asks for by customers is metrics for Azure Storage, like queue length. It's obviously specific to the DurableTask.AzureStorage backend but it would be useful if the different backends could add their own metrics as part of this work.
Name | Instrument Type | Unit | Unit (ucum) | Description |
---|---|---|---|---|
durabletask.task.limit |
Async UpDownCounter | default unit | {concurrent_task_limit} |
The configured limit of concurrent tasks for this worker. Attributes will define orchestration vs activity. |
durabletask.task.current |
Async UpDownCounter | default unit | {conccurent_task_current} |
The current concurrent tasks (activity or orchestration) running on the worker. Attributes will define orchestration vs activity. |
durabletask.task.duration |
Histogram | milliseconds | ms |
Measures the duration of a task. |
durabletask.task.count |
Counter | default unit | {task_count} |
The number of tasks that have been processed. |
durabletask.errors |
Counter | default unit | {errors} |
Number of task invocation errors. |
Name | Instrument Type | Unit | Unit (ucum) | Description |
---|---|---|---|---|
durabletask.azure_storage.partition.delay |
Histogram | milliseconds | ms |
Measures the delay of the messages as they are dequeued. |
durabletask.azure_storage.partition.length |
Async UpDownCounter | default unit | {item_count} |
The count of messages in a partition. |
durabletask.azure_storage.errors |
Counter | default unit | {errors} |
Number of task invocation errors. |
note: should we include or exclude azure_storage
section?
Name | Requirement level | Description | Examples |
---|---|---|---|
durabletask.task.type |
Required | The type of task being ran. SHOULD be one of: activity , orchestration |
` |
durabletask.task.name |
Required | The name of the task being ran. Example MyOrchestration |
MyOrchestration , MyActivity |
durabletask.task.version |
Conditionally Required | The version of the task being ran. Omitted when version is null . |
0 , 1 , v1 |
durabletask.task.status_code |
Required | The status code of a completed task. This will be the terminal state of the task. | succeeded , failed , terminated , canceled |
durabletask.task.sub_status_code |
Optional | This is a consumer supplied string [1], think of an open-ended HTTP status code | my_failure_reason , other_failure_reason |
[1]: May need to think about this more. But I see value in having a more granular code for failure reason. It is valuable to differentiate in monitors between expected/transient and unexpected/important failures.
Name | Requirement level | Description | Examples |
---|---|---|---|
durabletask.azure_storage.partition.name |
Required | The name of the partition represented in this metric. | {hubname}-workitems , {hubname}-control-01 |
note: DTFx orchestration service packages SHOULD still include Core attributes when possible.
One thing we've received asks for by customers is metrics for Azure Storage, like queue length. It's obviously specific to the DurableTask.AzureStorage backend but it would be useful if the different backends could add their own metrics as part of this work.
Yeah that is definitely important. But I do wonder if that is something DTFx should implement? Or should Azure Storage be responsible for that? I guess DTFx could add one for now, but have it opt-in only via some startup value.
edit: added dtfx.partition.length
above.
Yeah that is definitely important. But I do wonder if that is something DTFx should implement? Or should Azure Storage be responsible for that? I guess DTFx could add one for now, but have it opt-in only via some startup value.
The problem with reporting it from DTFx is that DTFx doesn't have any concept of queues, partitions, or even work-item latency today. If we want DTFx to be able to report this, then we'll probably need to add some optional interface that the backends can implement to surface this information to DTFx.Core.
I see you added dtfx.partition.name
and dtfx.partition.length
. It's a little strange since not all backends have the concept of partitions (MSSQL doesn't - less sure about Service Bus). I suppose for those kinds of orchestration services, they could just report having one "default" partition, which is the full backlog size?
@cgillum each DTFx orchestration service library can emit its own metrics. In this case, DurableTask.AzureStorage
should be emitting those metrics under its own Meter
. I will update my table to make that more clear.
Edit: I have separated example metrics and attributes between Core and AzureStorage concerns.
With the release of .NET 6 last year, a new metrics API was introduced. This is available in System.Diagnostics.DiagnosticSource 6.0 package, which is backwards compatible with older .net runtimes (so we do not need to target .NET 6)
https://docs.microsoft.com/en-us/dotnet/core/diagnostics/metrics-instrumentation
We should use this API to emit metrics for select DTFx scenarios. Customers can then listen to these metrics themselves and export them out of process appropriately, or use an existing SDK like OpenTelemetry to export them.
We can start with building a list of metrics we want to collect, their names, value significance, and any dimensions.
Relies on #698