dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.86k stars 4.62k forks source link

HttpClient spans don't show up as exemplars #103893

Closed JamesNK closed 1 month ago

JamesNK commented 2 months ago

Description

I'm adding a new exemplar feature to the Aspire dashboard. Basically, when a metric is recorded, in-progress spans are included as example data.

However, I never see the HttpClient span as an exemplar. Someone should investigate why and fix this.

I don't know exactly how this feature technically works. However, my theory is the HttpClient span is ended before http.client.request.duration recorded, so the HttpClient activity isn't available for opentelemetry-sdk to associate with the metric. The order may need to be flipped: record http.client.request.duration and then stop the activity. I think it is a matter of making the metrics handler a child of the diagnostics handler here.

Reproduction Steps

Capture OTEL data with metrics + exemplars enabled + make HTTP requests with HttpClient.

See https://github.com/dotnet/aspire/pull/4629 for an example.

Expected behavior

HttpClient spans show update as exemplars.

Actual behavior

HttpClient spans are never exemplars.

Regression?

No

Known Workarounds

No response

Configuration

.NET 8

Other information

No response

JamesNK commented 2 months ago

cc @reyang @noahfalk @tarekgh @samsp-msft

JamesNK commented 2 months ago

Update: I do see HttpClient spans show up as exemplars for some metrics: dns.lookup.duration, and http.client.request.time_in_queue.

For example, the points here are correctly HttpClient spans: image

But HttpClient spans don't show up in http.client.request.duration (IMO, the most important one for them to be visible in). Only ASP.NET Core and gRPC spans are exemplars are visible.

It seems like it is important to always end a duration metric before its associated activity so exemplars work. Is that correct @reyang?

samsp-msft commented 2 months ago
cijothomas commented 2 months ago

It seems like it is important to always end a duration metric before its associated activity so exemplars work. Is that correct

Yes. OTel's instrumentation library is reporting metric while Activity is still active, though that has some hacky behavior as it relies on activity duration, which is not normally available unless activity itself ends!

CodeBlanch commented 2 months ago

@JamesNK

If you are doing...

builder.SetExemplarFilter(ExemplarFilterType.TraceBased)

Then yes there will need to be an active trace (Activity) alive when the measurement is recorded to get an exemplar spit out.

There are other filters (and an envvar for histograms) you can use which will generate exemplars for all measurements but if there is no Activity they won't be linked to the trace you'll just get raw measurement + any tags that were filtered out.

PS: ❤️ the visualization in the dashboard.

cijothomas commented 2 months ago

I think the case here is - the HttpClient activity is ended, but the Asp.Net Core one is still active, so Exempalr will still link to the right traceid

antonfirsov commented 1 month ago

The order may need to be flipped: record http.client.request.duration and then stop the activity. I think it is a matter of making the metrics handler a child of the diagnostics handler

@JamesNK would it make sense if I would send you a private build of System.Net.Http.dll with this change so you can test if it helps before I move on opening a PR? If yes what version/branch should I base the change on?

JamesNK commented 1 month ago

Ok. Update to date main branch is fine.

This is a recent daily .NET 9 SDK I installed - 9.0.100-preview.7.24329.1. I can update to the latest daily .NET 9 SDK if required.

JamesNK commented 1 month ago

To write a unit test for this I think you would assert in a MeterListener event that the http.client.request.duration value is received while the HTTP activity is present.

Update: See https://github.com/dotnet/aspnetcore/pull/56592 as an example

antonfirsov commented 1 month ago

@JamesNK built from the current main: System.Net.Http-SwappedTelemetry.zip, PTAL if it does the job!

Note that the naive change breaks the ability of registering metrics enrichment callbacks on System.Net.Http.HttpRequestOut.Stop DaignosticSource events, unless we are willing to commit to a refactor that merges the two handlers so we can reorder things in complicated ways. I would prefer taking the breaking change instead, since I doubt there are many users doing enrichment together with DiagnosticListener. In fact, metrics enrichment seems to be a vastly unused feature. Thoughts?

It may also break other stuff, didn't have a chance yet to investigate the impact fully.

JamesNK commented 1 month ago

I don't think we should worry about it.

JamesNK commented 1 month ago

I couldn't test this with Aspire for reasons, so I wrote a test app that exports to console: https://github.com/JamesNK/MetricsExemplarsTest

9.0 current (note: no exemplar):

image

9.0 with custom System.Net.Http (note: has exemplar):

image

It works.