Open 4865783a5d opened 1 year ago
Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.
Author: | 4865783a5d |
---|---|
Assignees: | - |
Labels: | `area-System.Net.Http`, `untriaged` |
Milestone: | - |
A lot of moving parts here. Are you sure Polly is not retrying the request in 502/503 case (packet capture might help)? Could you run it without Polly? @MihaZupan any insights here?
This issue has been marked needs-author-action
and may be missing some important information.
@ManickaP we've run a version without Polly for the last few days in a 80/20 traffic routed scenario and the false negatives have indeed stopped from occurring. We're not seeing any retries in Polly - although it might be transparent. But I'll investigate on that end and report back here.
This issue has been marked needs-author-action
and may be missing some important information.
Can you please elaborate on what you mean by "false-negatives" here? It isn't immediately obvious to me from the screenshot of the trace. What is the expected/actual behavior?
Is the issue reproducible on its own? That is, can it be isolated down to a simple project that we can run ourselves to investigate?
@MihaZupan we would not expect failed HttpClient
http dependency telemetry and a successful child span indicating a successful request dependency telemetry:
This generates a false-negative Alert in our Application Insights since we watch failed dependencies.
However, we were able to narrow it down to a Polly-Retry. The failed http dependency telemetry seems to be emitted even if a retry was successful - but what we're missing here is the successful retry.
See the failed request without Polly:
Similar issues regarding Polly + Dependency call logging:
https://github.com/microsoft/ApplicationInsights-dotnet/issues/1923 https://github.com/microsoft/ApplicationInsights-dotnet/issues/2556
I created a gist (https://gist.github.com/4865783a5d/79374dcbce7f7a3d06c622352a425dfa) which targets the /health endpoint of an Azure App Service. The health check performs a single call to a Azure Sql DB to check if a connection can be mode.
On stopping the app service for a brief moment and starting again, the behavior can be observed:
When distributed tracing support was added to SocketsHttpHandler
, the assumption made was that if a request already has all the tracing-related headers set, we should avoid instrumenting it to avoid interfering with any existing user logic.
The exception to that was if we were doing internal redirects, where we would clear these headers before issuing the new request.
This works fine as long as the request is not reused by user code. While we generally dissuade users from reusing requests, this does happen in practice (a prime example being Polly as in this case).
The issue that is happening here is:
HttpClient.SendAsync
SocketsHttpHandler
SocketsHttpHandler
calls into DiagnosticsHandler
DiagnosticsHandler
adds the tracing headers and communicates with the AppInsights SDKSocketsHttpHandler
calls into DiagnosticsHandler
DiagnosticsHandler
tries to add the tracing headers, but no-ops as they are already set on the requestOutgoing dependency call => Failure
Downstream server => 200
Your client is making two HTTP requests in this case, but both were instrumented with the same ID, and AppInsights will only remember the result of the first one. The first request failed, so you see the failed dependency call. But the second request used the same ID, so it was correlated to the same operation. The destination server returned a successful response, so you see a 200 there.
What would be the expected behavior here is for these two requests to be instrumented separately. That is, your trace should look like this instead:
Outgoing dependency call => Failure
Outgoing dependency call => 200
Downstream server => 200
Note that the first request would still show up as a failure as the operation did fail, so even with the correct behavior, your alerts would likely be impacted. Was your expectation that requests which initially failed but succeeded under retries would still appear on your traces, or that they would be silently ignored?
To achieve the above behavior, you can use a workaround like this to manually clear out these headers between retries: https://gist.github.com/MihaZupan/677cb2a1775325fa1aa5c3ac7f263c2b
@MihaZupan Thanks for the in-depth explanation - thats most helpful. We'll have a look at the work around and have also adjusted our App Insights Alerting to exclude successful retries.
We would have expected that Polly retries are transparent and either one successful or failed span would be emitted. Seeing all retries (and initial request) would have been fine as well. The middle ground is not ideal.
Triage: Requests being retried/reused is a reality, we should make sure tracing isn't randomly broken when it happens. Optimistically moving to 8.0.
Unfortunately, we will not be able to fix it in 8.0. Moving to Future for now.
Description
We started seeing random, false-negative dependency traces of http requests where the result status was false and result code was either 503, 502 or "Faulted". As can be seen in the trace, subsequent child spans all are successful and the request is accepted by the server and a 200 response code (+payload) is returned.
Reproduction Steps
https://gist.github.com/4865783a5d/79374dcbce7f7a3d06c622352a425dfa
Expected behavior
Correct state of dependency telemetry.
Actual behavior
Dependency telemetry is failed.
Regression?
The behavior started occuring when we updated transitive dependencies of various NuGet Packages. Note that somehow, a transitive package downgraded the "Microsoft.NETCore.Platforms" down to 3.1 where it previously was 5.0
Both versions use:
Other Packages before behavior occured:
Other Packages when behavior started to occur:
Known Workarounds
No response
Configuration
Windows Azure App Service
Other information
We are using Polly with the default transient error handling.