HttpClient false-negative dependency traces

4865783a5d commented 1 year ago

Description

We started seeing random, false-negative dependency traces of http requests where the result status was false and result code was either 503, 502 or "Faulted". As can be seen in the trace, subsequent child spans all are successful and the request is accepted by the server and a 200 response code (+payload) is returned.

false-negatives

Reproduction Steps

https://gist.github.com/4865783a5d/79374dcbce7f7a3d06c622352a425dfa

Expected behavior

Correct state of dependency telemetry.

Actual behavior

Dependency telemetry is failed.

Regression?

The behavior started occuring when we updated transitive dependencies of various NuGet Packages. Note that somehow, a transitive package downgraded the "Microsoft.NETCore.Platforms" down to 3.1 where it previously was 5.0

Both versions use:

Microsoft.ApplicationInsights 2.21.0
Microsoft.ApplicationInsights.DependencyCollector 2.21.0
Microsoft.ApplicationInsights.EventCounterCollector 2.21.0
Microsoft.ApplicationInsights.PerfCounterCollector 2.21.0
Microsoft.ApplicationInsights.WindowsServer 2.21.0
Microsoft.ApplicationInsights.WindowsServer.TelemetryChannel 2.21.0
Microsoft.AspNetCore.Authentication 2.2.0
Microsoft.AspNetCore.Authentication.Abstractions 2.2.0
Microsoft.AspNetCore.Authentication.Core 2.2.0
Microsoft.AspNetCore.Authentication.JwtBearer 6.0.7
Microsoft.AspNetCore.Cryptography.Internal 2.2.0
Microsoft.AspNetCore.DataProtection 2.2.0
Microsoft.AspNetCore.DataProtection.Abstractions 2.2.0
Microsoft.AspNetCore.Hosting 2.2.7
Microsoft.AspNetCore.Hosting.Abstractions 2.2.0
Microsoft.AspNetCore.Hosting.Server.Abstractions 2.2.0
Microsoft.AspNetCore.Http 2.2.0
Microsoft.AspNetCore.Http.Abstractions 2.2.0
Microsoft.AspNetCore.Http.Extensions 2.2.0
Microsoft.AspNetCore.Http.Features 5.0.17
Microsoft.AspNetCore.WebUtilities 2.2.0
Microsoft.Azure.AppConfiguration.AspNetCore 5.0.0
Microsoft.Bcl.AsyncInterfaces 1.1.1
Microsoft.CSharp 4.5.0
Microsoft.ApplicationInsights.AspNetCore 2.21.0 2.21.0

Other Packages before behavior occured:

Azure.Core 1.24.0 Microsoft.Extensions.Azure 1.2.0 Microsoft.NETCore.Platforms 5.0.0
Microsoft.Win32.Registry 5.0.0
Microsoft.Win32.SystemEvents 5.0.0
System.Diagnostics.PerformanceCounter 5.0.0

Other Packages when behavior started to occur:

Azure.Core 1.30.0 Microsoft.Extensions.Azure 1.6.3 Microsoft.NETCore.Platforms 3.1.0
Microsoft.Win32.Registry 4.7.0
Microsoft.Win32.SystemEvents 4.7.0 System.Diagnostics.PerformanceCounter 4.7.0

Known Workarounds

No response

Configuration

net6.0

Windows Azure App Service

Other information

We are using Polly with the default transient error handling.

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.

Issue Details

### Description We started seeing random, false-negative dependency traces of http requests where the result status was false and result code was either 503, 502 or "Faulted". As can be seen in the trace, subsequent child spans all are successful and the request is accepted by the server and a 200 response code (+payload) is returned. ![false-negatives](https://github.com/dotnet/runtime/assets/8995460/78c1d7e9-f5fa-483b-a384-3b5b725be96a) ### Reproduction Steps Since the behavior is random and only present in 2 of 4 environments (Which have more load than others), its currently not reproducable. Service Registration ```csharp services.AddHttpClient() .SetHandlerLifetime(TimeSpan.FromMinutes(5)) .AddPolicyHandlerFromRegistry(PolicyConstants.DefaultRetryPolicy); ``` Http Client call ```csharp using var requestMessage = new HttpRequestMessage(HttpMethod.Get, url) { Headers = { Authorization = new("Bearer", tokenResponse.AccessToken) } }; await HttpClient.SendAsync(requestMessage, cancellationToken); ``` ### Expected behavior Correct state of dependency telemetry. ### Actual behavior Dependency telemetry is failed. ### Regression? The behavior started occuring when we updated transitive dependencies of various NuGet Packages. Note that somehow, a transitive package downgraded the "Microsoft.NETCore.Platforms" down to 3.1 where it previously was 5.0 Both versions use: > Microsoft.ApplicationInsights 2.21.0 > Microsoft.ApplicationInsights.DependencyCollector 2.21.0 > Microsoft.ApplicationInsights.EventCounterCollector 2.21.0 > Microsoft.ApplicationInsights.PerfCounterCollector 2.21.0 > Microsoft.ApplicationInsights.WindowsServer 2.21.0 > Microsoft.ApplicationInsights.WindowsServer.TelemetryChannel 2.21.0 > Microsoft.AspNetCore.Authentication 2.2.0 > Microsoft.AspNetCore.Authentication.Abstractions 2.2.0 > Microsoft.AspNetCore.Authentication.Core 2.2.0 > Microsoft.AspNetCore.Authentication.JwtBearer 6.0.7 > Microsoft.AspNetCore.Cryptography.Internal 2.2.0 > Microsoft.AspNetCore.DataProtection 2.2.0 > Microsoft.AspNetCore.DataProtection.Abstractions 2.2.0 > Microsoft.AspNetCore.Hosting 2.2.7 > Microsoft.AspNetCore.Hosting.Abstractions 2.2.0 > Microsoft.AspNetCore.Hosting.Server.Abstractions 2.2.0 > Microsoft.AspNetCore.Http 2.2.0 > Microsoft.AspNetCore.Http.Abstractions 2.2.0 > Microsoft.AspNetCore.Http.Extensions 2.2.0 > Microsoft.AspNetCore.Http.Features 5.0.17 > Microsoft.AspNetCore.WebUtilities 2.2.0 > Microsoft.Azure.AppConfiguration.AspNetCore 5.0.0 > Microsoft.Bcl.AsyncInterfaces 1.1.1 > Microsoft.CSharp 4.5.0 > Microsoft.ApplicationInsights.AspNetCore 2.21.0 2.21.0 Other Packages before behavior occured: > Azure.Core 1.24.0 > Microsoft.Extensions.Azure 1.2.0 > Microsoft.NETCore.Platforms 5.0.0 > Microsoft.Win32.Registry 5.0.0 > Microsoft.Win32.SystemEvents 5.0.0 > System.Diagnostics.PerformanceCounter 5.0.0 Other Packages when behavior started to occur: > Azure.Core 1.30.0 > Microsoft.Extensions.Azure 1.6.3 > Microsoft.NETCore.Platforms 3.1.0 > Microsoft.Win32.Registry 4.7.0 > Microsoft.Win32.SystemEvents 4.7.0 > System.Diagnostics.PerformanceCounter 4.7.0 ### Known Workarounds _No response_ ### Configuration net6.0 Windows Azure App Service ### Other information The only significant change between the two versions was adding Polly to the http clients. ``` builder.Registry.Add(PolicyConstants.DefaultRetryPolicy, HttpPolicyExtensions .HandleTransientHttpError() .WaitAndRetryAsync(3, retryAttempt => TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)))); return builder; ```

Author:	4865783a5d
Assignees:	-
Labels:	`area-System.Net.Http`, `untriaged`
Milestone:	-

ManickaP commented 1 year ago

A lot of moving parts here. Are you sure Polly is not retrying the request in 502/503 case (packet capture might help)? Could you run it without Polly? @MihaZupan any insights here?

ghost commented 1 year ago

This issue has been marked needs-author-action and may be missing some important information.

4865783a5d commented 1 year ago

@ManickaP we've run a version without Polly for the last few days in a 80/20 traffic routed scenario and the false negatives have indeed stopped from occurring. We're not seeing any retries in Polly - although it might be transparent. But I'll investigate on that end and report back here.

ghost commented 1 year ago

This issue has been marked needs-author-action and may be missing some important information.

MihaZupan commented 1 year ago

Can you please elaborate on what you mean by "false-negatives" here? It isn't immediately obvious to me from the screenshot of the trace. What is the expected/actual behavior?

Is the issue reproducible on its own? That is, can it be isolated down to a simple project that we can run ourselves to investigate?

4865783a5d commented 1 year ago

@MihaZupan we would not expect failed HttpClient http dependency telemetry and a successful child span indicating a successful request dependency telemetry:

This generates a false-negative Alert in our Application Insights since we watch failed dependencies.

However, we were able to narrow it down to a Polly-Retry. The failed http dependency telemetry seems to be emitted even if a retry was successful - but what we're missing here is the successful retry.

See the failed request without Polly:

Similar issues regarding Polly + Dependency call logging:

https://github.com/microsoft/ApplicationInsights-dotnet/issues/1923 https://github.com/microsoft/ApplicationInsights-dotnet/issues/2556

I created a gist (https://gist.github.com/4865783a5d/79374dcbce7f7a3d06c622352a425dfa) which targets the /health endpoint of an Azure App Service. The health check performs a single call to a Azure Sql DB to check if a connection can be mode.

On stopping the app service for a brief moment and starting again, the behavior can be observed:

MihaZupan commented 1 year ago

When distributed tracing support was added to SocketsHttpHandler, the assumption made was that if a request already has all the tracing-related headers set, we should avoid instrumenting it to avoid interfering with any existing user logic. The exception to that was if we were doing internal redirects, where we would clear these headers before issuing the new request.

This works fine as long as the request is not reused by user code. While we generally dissuade users from reusing requests, this does happen in practice (a prime example being Polly as in this case).

The issue that is happening here is:

Your code calls HttpClient.SendAsync
- Polly's handler calls into the inner handler - SocketsHttpHandler
- SocketsHttpHandler calls into DiagnosticsHandler
  - DiagnosticsHandler adds the tracing headers and communicates with the AppInsights SDK
  - The request fails (e.g. network error)
- Polly's handler observes the failure and retries with the same request object
- SocketsHttpHandler calls into DiagnosticsHandler
  - DiagnosticsHandler tries to add the tracing headers, but no-ops as they are already set on the request

Outgoing dependency call => Failure
    Downstream server => 200

Your client is making two HTTP requests in this case, but both were instrumented with the same ID, and AppInsights will only remember the result of the first one. The first request failed, so you see the failed dependency call. But the second request used the same ID, so it was correlated to the same operation. The destination server returned a successful response, so you see a 200 there.

What would be the expected behavior here is for these two requests to be instrumented separately. That is, your trace should look like this instead:

Outgoing dependency call => Failure

Outgoing dependency call => 200
    Downstream server => 200

Note that the first request would still show up as a failure as the operation did fail, so even with the correct behavior, your alerts would likely be impacted. Was your expectation that requests which initially failed but succeeded under retries would still appear on your traces, or that they would be silently ignored?

To achieve the above behavior, you can use a workaround like this to manually clear out these headers between retries: https://gist.github.com/MihaZupan/677cb2a1775325fa1aa5c3ac7f263c2b

4865783a5d commented 1 year ago

@MihaZupan Thanks for the in-depth explanation - thats most helpful. We'll have a look at the work around and have also adjusted our App Insights Alerting to exclude successful retries.

We would have expected that Polly retries are transparent and either one successful or failed span would be emitted. Seeing all retries (and initial request) would have been fine as well. The middle ground is not ideal.

MihaZupan commented 1 year ago

Triage: Requests being retried/reused is a reality, we should make sure tracing isn't randomly broken when it happens. Optimistically moving to 8.0.

karelz commented 1 year ago

Unfortunately, we will not be able to fix it in 8.0. Moving to Future for now.

dotnet / runtime