dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.44k stars 4.76k forks source link

Stalled getaddrinfo syscall can lead to massive HttpClient timeouts #81023

Open sksk571 opened 1 year ago

sksk571 commented 1 year ago

Edit by @antonfirsov: This is caused by stalled DNS resolution attempts, see https://github.com/dotnet/runtime/issues/81023#issuecomment-1405441025.

Description

We have a small web application that started sporadically throwing TaskCancelledException's after migration from .NET5 to .NET6 and later .NET7. The application consists of just one controller method that retrieves image from Amazon S3, resizes it and saves a resized copy back to S3. The application works fine under .NET5, but under .NET6 and .NET7 it sporadically throws thousands of TaskCancelledException errors from method HttpConnectionPool+HttpConnectionWaiter`1+

Configuration

Regression?

This is a regression in .NET6+ compared to .NET5.

Data

Dump files captured during one of the incidents contain thousands of async calls looking like

.00007f0f9ec03878 (0) System.Net.Http.HttpConnectionPool+HttpConnectionWaiter`1+<WaitForConnectionAsync>d__5[[System.__Canon, System.Private.CoreLib]]
..00007f0f9ec03918 (4) System.Net.Http.HttpConnectionPool+<SendWithVersionDetectionAndRetryAsync>d__84
...00007f0f9ec03a38 (0) System.Net.Http.DiagnosticsHandler+<SendAsyncCore>d__10
....00007f0f9ec03b00 (0) System.Net.Http.HttpClient+<<SendAsync>g__Core|83_0>d
.....00007f0f9ec03bc0 (0) Amazon.Runtime.HttpWebRequestMessage+<GetResponseAsync>d__20

and

.00007f0f9ec1c840 (1) System.Net.Http.HttpConnectionPool+<ConnectToTcpHostAsync>d__99
..00007f0f9ec1c910 (0) System.Net.Http.HttpConnectionPool+<ConnectAsync>d__98
...00007f0f9ec1c9d8 (0) System.Net.Http.HttpConnectionPool+<CreateHttp11ConnectionAsync>d__100
....00007f0f9ec1ca98 (0) System.Net.Http.HttpConnectionPool+<AddHttp11ConnectionAsync>d__74

The events from a trace captured using System.Net.Http provider: dotnet_20230119_154543_2847c867.9.excel.xlsx

Analysis

Tracing some of the failed requests to connections using ActivityID shows that the corresponding connection was established and closed, but not used for some reason. This issue may be related.

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.

Issue Details
### Description We have a small web application that started sporadically throwing TaskCancelledException's after migration from .NET5 to .NET6 and later .NET7. The application consists of just one controller method that retrieves image from Amazon S3, resizes it and saves a resized copy back to S3. The application works fine under .NET5, but under .NET6 and .NET7 it sporadically throws thousands of TaskCancelledException errors from method HttpConnectionPool+HttpConnectionWaiter`1+ ### Configuration - Version: .NET6+ - OS version: Alpine Linux 3.16 - Architecture: x64 ### Regression? This is a regression in .NET6+ compared to .NET5. ### Data Dump files captured during one of the incidents contain thousands of async calls looking like ``` .00007f0f9ec03878 (0) System.Net.Http.HttpConnectionPool+HttpConnectionWaiter`1+d__5[[System.__Canon, System.Private.CoreLib]] ..00007f0f9ec03918 (4) System.Net.Http.HttpConnectionPool+d__84 ...00007f0f9ec03a38 (0) System.Net.Http.DiagnosticsHandler+d__10 ....00007f0f9ec03b00 (0) System.Net.Http.HttpClient+<g__Core|83_0>d .....00007f0f9ec03bc0 (0) Amazon.Runtime.HttpWebRequestMessage+d__20 ``` and ``` .00007f0f9ec1c840 (1) System.Net.Http.HttpConnectionPool+d__99 ..00007f0f9ec1c910 (0) System.Net.Http.HttpConnectionPool+d__98 ...00007f0f9ec1c9d8 (0) System.Net.Http.HttpConnectionPool+d__100 ....00007f0f9ec1ca98 (0) System.Net.Http.HttpConnectionPool+d__74 ``` The events from a trace captured using System.Net.Http provider: [dotnet_20230119_154543_2847c867.9.excel.xlsx](https://github.com/dotnet/runtime/files/10480348/dotnet_20230119_154543_2847c867.9.excel.xlsx) ### Analysis Tracing some of the failed requests to connections using ActivityID shows that the corresponding connection was established and closed, but not used for some reason. This [issue](https://github.com/dotnet/runtime/issues/66297) may be related.
Author: sksk571
Assignees: -
Labels: `area-System.Net.Http`, `tenet-performance`
Milestone: -
MihaZupan commented 1 year ago

after migration from .NET5 to .NET6 and later .NET7

We're aware of this issue on .NET 6, and it can be mitigated by setting the ConnectTimeout (it's infinite by default).

new SocketsHttpHandler
{
    ConnectTimeout = TimeSpan.FromSeconds(5)
}

and later .NET7

Are you seeing similar behavior on .NET 7? 7.0 has a mitigation in place that should prevent this issue from affecting many requests - #71785.

The events from a trace captured using System.Net.Http provider

If this situation happens again, could you also collect the following events?

System.Net.Sockets
System.Net.Security
Private.InternalDiagnostics.System.Net.Http

That would give us more information about what the connection pool was doing at the time.

antonfirsov commented 1 year ago

Seeing this happening on .NET 7.0 is very weird, I wouldn't be surprised if this is a different issue, especially if it's true that:

Tracing some of the failed requests to connections using ActivityID shows that the corresponding connection was established and closed, but not used for some reason.

@sksk571 do you see the same scale of TaskCancelledException-s and closed connections on .NET 7.0 as on .NET 6.0?

I would really like to see the additional traces collected on .NET 7.0, if setting ConnectTimeout (on .NET 6.0!) doesn't help.

sksk571 commented 1 year ago

The issue definitely happens on .NET7, though maybe to a lesser extent than on .NET6. The workaround with setting ConnectTimeout in SocketsHttpHandler doesn't work for us, as the affected code is in a netstandard2.0 library where SocketsHttpHandler is not available.

I was able to collect the trace with the additional events, how can I share it privately? The trace is from production system so I'd avoid attaching it to the issue.

MihaZupan commented 1 year ago

Can you send it to Miha.Zupan and Anton.Firszov @microsoft.com?

antonfirsov commented 1 year ago

After exchanging information with @sksk571 it looks like there are sporadic stalled getaddrinfo calls on their server. As a result of #49171, a single stalled call will hang any follow-up DNS resolution request which go to the same host. This leads to a massive accumulation of pending connection attempts in HttpConnectionPool. These will only complete (fail) when the first stalled DNS resolution attempt completes. By this time the originating requests are timed out and end up throwing.

We can consider mitigating this by:

Whether we are open to implement such mitigations depends whether such DNS misbehavior is common. It would be good to understand what's causing this on your infra @sksk571. Is the DNS server unresponsive? Is there any timeout configured in resolv.conf? Does it help if you lower it?

MihaZupan commented 1 year ago

Looks like a +1 on #19443

antonfirsov commented 1 year ago

Triage: only one customer report so far, the root cause is environmental. Putting to "Future" for now, if we get more customers hitting this we can bump the priority.

matt-augustine commented 1 year ago

We are experiencing similar behavior in a high-traffic ASP.NET Core application on .NET 7.0.101. The application runs on Kubernetes, and the issue typically occurs when new pods are created and begin processing a high volume of requests. The application makes numerous outbound HTTP requests to various cloud services while processing each inbound request.

The symptoms include:

antonfirsov commented 1 year ago

@matt-augustine I'm experimenting with a fix now. Would be great if you could try a custom build for your case for validation. Will reach out on Teams.

karelz commented 10 months ago

Moving to 9.0 as it hit 3 customers. They found workarounds, but it would be nice to avoid the problem for others in future.

DejanPelzel commented 5 months ago

I believe we were severely affected by this issue, which resulted in massive stability issues that were very difficult to debug. We were able to work around by using a custom DNS resolver and SocketsHttpHandler with ConnectCallback, but it seems this is wider spread than 3 customers and can cause massive damage for things like edge services that communicate via HTTP.