djluck / prometheus-net.DotNetRuntime

Exposes .NET core runtime metrics (GC, JIT, lock contention, thread pool) using the prometheus-net package
MIT License
356 stars 84 forks source link
metrics monitoring netcore31 netcore5 prometheus-net runtime-metrics

prometheus-net.DotNetMetrics

A plugin for the prometheus-net package, exposing .NET core runtime metrics including:

These metrics are essential for understanding the performance of any non-trivial application. Even if your application is well instrumented, you're only getting half the story- what the runtime is doing completes the picture.

Using this package

Requirements

Install it

The package can be installed from nuget:

dotnet add package prometheus-net.DotNetRuntime

Start collecting metrics

You can start metric collection with:

IDisposable collector = DotNetRuntimeStatsBuilder.Default().StartCollecting()

You can customize the types of .NET metrics collected via the Customize method:

IDisposable collector = DotNetRuntimeStatsBuilder
    .Customize()
    .WithContentionStats()
    .WithJitStats()
    .WithThreadPoolStats()
    .WithGcStats()
    .WithExceptionStats()
    .StartCollecting();

Once the collector is registered, you should see metrics prefixed with dotnet_ visible in your metric output (make sure you are exporting your metrics).

Choosing a CaptureLevel

By default the library will default generate metrics based on event counters. This allows for basic instrumentation of applications with very little performance overhead.

You can enable higher-fidelity metrics by providing a custom CaptureLevel, e.g:

DotNetRuntimeStatsBuilder
    .Customize()
    .WithGcStats(CaptureLevel.Informational)
    .WithExceptionStats(CaptureLevel.Errors)
    ...

Most builder methods allow the passing of a custom CaptureLevel- see the documentation on exposed metrics for more information.

Performance impact of CaptureLevel.Errors+

The harder you work the .NET core runtime, the more events it generates. Event generation and processing costs can stack up, especially around these types of events:

Recycling collectors

There have been long-running performance issues since .NET core 3.1 that could see CPU consumption grow over time when long-running trace sessions are used. While many of the performance issues have been addressed now in .NET 6.0, a workaround was identified: stopping and starting (AKA recycling) collectors periodically helped reduce CPU consumption:

IDisposable collector = DotNetRuntimeStatsBuilder.Default()
    // Recycles all collectors once every day
    .RecycleCollectorsEvery(TimeSpan.FromDays(1))
    .StartCollecting()

While this has been observed to reduce CPU consumption this technique has been identified as a possible culprit that can lead to application instability.

Behaviour on different runtime versions is:

TLDR: If you observe increasing CPU over time, try enabling recycling. If you see unexpected crashes after using this application, try disabling recycling.

Examples

An example docker-compose stack is available in the examples/ folder. Start it with:

docker-compose up -d

You can then visit http://localhost:3000 to view metrics being generated by a sample application.

Grafana dashboard

The metrics exposed can drive a rich dashboard, giving you a graphical insight into the performance of your application ( exported dashboard available here):

Grafana dashboard sample

Further reading