Azure / azure-cosmos-dotnet-v3

.NET SDK for Azure Cosmos DB for the core SQL API
MIT License
739 stars 493 forks source link

Open Telemetry : CPU and Memory Usage tracking #4818

Open sourabh1007 opened 1 day ago

sourabh1007 commented 1 day ago

What do we want to collect? (as Implemented in java SDK)

Metric Name Unit Metric Type Description
cosmos.client.system.avgCpuLoad Percent 95th, 99th + histogram SDK measures avg. system-wide CPU every 10 seconds. This meter captures the 5-second avg. CPU usage measurements.
cosmos.client.system.freeMemoryAvailable MB None SDK measures free memory available for the process in MB every 10 seconds. This meter captures the 5-second measurements.

Available Open Telemetry Compatible Packages

NuGet Gallery | OpenTelemetry.Instrumentation.Runtime 1.9.0 Usage: https://github.com/open-telemetry/opentelemetry-dotnet-contrib/blob/main/examples/runtime-instrumentation/Program.cs Metrics List: https://github.com/open-telemetry/opentelemetry-dotnet-contrib/blob/main/src/OpenTelemetry.Instrumentation.Runtime/README.md

NuGet Gallery | OpenTelemetry.Instrumentation.Process 0.5.0-beta.6 Usage: https://github.com/open-telemetry/opentelemetry-dotnet-contrib/blob/main/examples/process-instrumentation/Program.cs Metrics List: https://github.com/open-telemetry/opentelemetry-dotnet-contrib/blob/main/src/OpenTelemetry.Instrumentation.Process/README.md#step-2-enable-process-instrumentation

.NET extensions metrics - .NET | Microsoft Learn Metrics List: https://learn.microsoft.com/en-us/dotnet/core/diagnostics/built-in-metrics-diagnostics#microsoftextensionsdiagnosticshealthchecks

In-Built Metrics: https://learn.microsoft.com/en-us/dotnet/core/diagnostics/built-in-metrics-runtime

What we need in Cosmos DB SDK?

We have observed that brief CPU spikes in the past have negatively impacted the customer experience. While existing libraries allow us to capture CPU usage at intervals, such as every minute (depending on the capabilities of the exporter), we require more granular data on CPU and memory usage.

Proposal: Enhance the SDK by introducing custom CPU and memory usage metrics. These metrics will collect and record data every 10 seconds, generating a histogram of the values, as outlined above.

lmolkova commented 1 day ago

It's an anti-pattern to emit runtime metrics in client-specific instrumentations. .NET 9 will have a bunch of native metrics https://github.com/open-telemetry/semantic-conventions/blob/main/docs/runtime/dotnet-metrics.md that cover these and many other things.

The interval at which metrics are collected is configured by users, not instrumentations - https://github.com/open-telemetry/opentelemetry-dotnet/blob/0343715f49ac8e121ec39acd92f8d5572b3d036d/src/OpenTelemetry/Metrics/Reader/PeriodicExportingMetricReaderOptions.cs#L47.

Cosmos measuring things more frequently will result in aggregation across the user-configured interval - https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#metricreader-operations