Azure / durabletask

Durable Task Framework allows users to write long running persistent workflows in C# using the async/await capabilities.
Apache License 2.0
1.52k stars 294 forks source link

Support metrics using System.Diagnostics.Metrics #785

Open jviau opened 2 years ago

jviau commented 2 years ago

With the release of .NET 6 last year, a new metrics API was introduced. This is available in System.Diagnostics.DiagnosticSource 6.0 package, which is backwards compatible with older .net runtimes (so we do not need to target .NET 6)

https://docs.microsoft.com/en-us/dotnet/core/diagnostics/metrics-instrumentation

We should use this API to emit metrics for select DTFx scenarios. Customers can then listen to these metrics themselves and export them out of process appropriately, or use an existing SDK like OpenTelemetry to export them.

We can start with building a list of metrics we want to collect, their names, value significance, and any dimensions.

Relies on #698

cgillum commented 2 years ago

One thing we've received asks for by customers is metrics for Azure Storage, like queue length. It's obviously specific to the DurableTask.AzureStorage backend but it would be useful if the different backends could add their own metrics as part of this work.

jviau commented 2 years ago

Metrics

Core

Name Instrument Type Unit Unit (ucum) Description
durabletask.task.limit Async UpDownCounter default unit {concurrent_task_limit} The configured limit of concurrent tasks for this worker. Attributes will define orchestration vs activity.
durabletask.task.current Async UpDownCounter default unit {conccurent_task_current} The current concurrent tasks (activity or orchestration) running on the worker. Attributes will define orchestration vs activity.
durabletask.task.duration Histogram milliseconds ms Measures the duration of a task.
durabletask.task.count Counter default unit {task_count} The number of tasks that have been processed.
durabletask.errors Counter default unit {errors} Number of task invocation errors.

Azure Storage

Name Instrument Type Unit Unit (ucum) Description
durabletask.azure_storage.partition.delay Histogram milliseconds ms Measures the delay of the messages as they are dequeued.
durabletask.azure_storage.partition.length Async UpDownCounter default unit {item_count} The count of messages in a partition.
durabletask.azure_storage.errors Counter default unit {errors} Number of task invocation errors.

note: should we include or exclude azure_storage section?

Attributes

Core

Name Requirement level Description Examples
durabletask.task.type Required The type of task being ran. SHOULD be one of: activity, orchestration `
durabletask.task.name Required The name of the task being ran. Example MyOrchestration MyOrchestration, MyActivity
durabletask.task.version Conditionally Required The version of the task being ran. Omitted when version is null. 0, 1, v1
durabletask.task.status_code Required The status code of a completed task. This will be the terminal state of the task. succeeded, failed, terminated, canceled
durabletask.task.sub_status_code Optional This is a consumer supplied string [1], think of an open-ended HTTP status code my_failure_reason, other_failure_reason

[1]: May need to think about this more. But I see value in having a more granular code for failure reason. It is valuable to differentiate in monitors between expected/transient and unexpected/important failures.

Azure Storage

Name Requirement level Description Examples
durabletask.azure_storage.partition.name Required The name of the partition represented in this metric. {hubname}-workitems, {hubname}-control-01

note: DTFx orchestration service packages SHOULD still include Core attributes when possible.

jviau commented 2 years ago

One thing we've received asks for by customers is metrics for Azure Storage, like queue length. It's obviously specific to the DurableTask.AzureStorage backend but it would be useful if the different backends could add their own metrics as part of this work.

Yeah that is definitely important. But I do wonder if that is something DTFx should implement? Or should Azure Storage be responsible for that? I guess DTFx could add one for now, but have it opt-in only via some startup value.

edit: added dtfx.partition.length above.

cgillum commented 2 years ago

Yeah that is definitely important. But I do wonder if that is something DTFx should implement? Or should Azure Storage be responsible for that? I guess DTFx could add one for now, but have it opt-in only via some startup value.

The problem with reporting it from DTFx is that DTFx doesn't have any concept of queues, partitions, or even work-item latency today. If we want DTFx to be able to report this, then we'll probably need to add some optional interface that the backends can implement to surface this information to DTFx.Core.

I see you added dtfx.partition.name and dtfx.partition.length. It's a little strange since not all backends have the concept of partitions (MSSQL doesn't - less sure about Service Bus). I suppose for those kinds of orchestration services, they could just report having one "default" partition, which is the full backlog size?

jviau commented 2 years ago

@cgillum each DTFx orchestration service library can emit its own metrics. In this case, DurableTask.AzureStorage should be emitting those metrics under its own Meter. I will update my table to make that more clear.

Edit: I have separated example metrics and attributes between Core and AzureStorage concerns.