dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.27k stars 4.73k forks source link

How to control HttpConnectionPool throttle? #68423

Closed Nican closed 2 years ago

Nican commented 2 years ago

Description

I am failing to create a lot of concurrent connections. I have this program trying to create hundreds of concurrent connections to Twitter, and it seems to be throttled by some internal throttling mechanism that I am not sure how to control.

The stack trace:

Error inner: System.Threading.Tasks.TaskCanceledException: The request was canceled due to the configured HttpClient.Timeout of 60 seconds elapsing.
 ---> System.TimeoutException: A task was canceled.
 ---> System.Threading.Tasks.TaskCanceledException: A task was canceled.
   at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at Tweetinvi.WebLogic.TwitterClientHandler.SendAsync(ITwitterQuery twitterQuery, HttpRequestMessage request, CancellationToken cancellationToken)

Here is the source code for Tweetinvi, and it seems to be making a new HttpClient per request. (Which is inefficient, but does not seem to be the problem)

Taking a look at GetHttp11ConnectionAsync, it seems to be stuck on a _http11RequestQueue. But I can not seem to understand how to control this queue.

Some additional details of why the request has timed out would be nice.

Configuration

.NET 6 on Ubuntu 22.04 x64, compiled to a native binary.

Regression?

Data

Analysis

Since the stack trace is reaching WaitWithCancellationAsync, I would assume that the code is passing through No available HTTP/1.1 connections; request queued. log message on line #556. What queue is this? How do I control this?

I have also added

ServicePointManager.DefaultConnectionLimit = int.MaxValue;
ThreadPool.SetMaxThreads(2048, 2048);
ThreadPool.SetMinThreads(256, 256);

To the first line after Main. And

handler.MaxConnectionsPerServer = int.MaxValue;

to the HttpClient. (After having to recompile Tweetinvi)

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.

Issue Details
### Description I am failing to create a lot of concurrent connections. I have this program trying to create hundreds of concurrent connections to Twitter, and it seems to be throttled by some internal throttling mechanism that I am not sure how to control. The stack trace: ``` Error inner: System.Threading.Tasks.TaskCanceledException: The request was canceled due to the configured HttpClient.Timeout of 60 seconds elapsing. ---> System.TimeoutException: A task was canceled. ---> System.Threading.Tasks.TaskCanceledException: A task was canceled. at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken) at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken) at Tweetinvi.WebLogic.TwitterClientHandler.SendAsync(ITwitterQuery twitterQuery, HttpRequestMessage request, CancellationToken cancellationToken) ``` Here is the source code for [Tweetinvi](https://github.com/linvi/tweetinvi/blob/91cdf83b87f6c0c75d7ece6a13fce792d820e3c8/src/Tweetinvi.WebLogic/HttpClientWebHelper.cs#L20-L60), and it seems to be making a new `HttpClient` per request. (Which is inefficient, but does not seem to be the problem) Taking a look at [GetHttp11ConnectionAsync](https://github.com/dotnet/runtime/blob/a10c1bcfab680991d35d0b414c59e088684b64ba/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/HttpConnectionPool.cs#L506), it seems to be stuck on a `_http11RequestQueue`. But I can not seem to understand how to control this queue. Some additional details of why the request has timed out would be nice. ### Configuration .NET 6 on Ubuntu 22.04 x64, compiled to a native binary. ### Regression? ### Data ### Analysis Since the stack trace is reaching `WaitWithCancellationAsync`, I would assume that the code is passing through `No available HTTP/1.1 connections; request queued.` log message on [line #556](https://github.com/dotnet/runtime/blob/a10c1bcfab680991d35d0b414c59e088684b64ba/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/HttpConnectionPool.cs#L556). What queue is this? How do I control this?
Author: Nican
Assignees: -
Labels: `area-System.Net.Http`, `tenet-performance`, `untriaged`
Milestone: -
Nican commented 2 years ago

Update: This seems to be a problem with Ubuntu 22.04.

Same workload and same runtime under Ubuntu 20.04 looks to be working fine.

MihaZupan commented 2 years ago

I have this program trying to create hundreds of concurrent connections to Twitter

Why?

How are you trying to create the connections?

Nican commented 2 years ago

I am just using the library, and I had to track down how the library was making the connections.

Unfortunately, the library seems to create 1 HttpClient per call: https://github.com/linvi/tweetinvi/blob/91cdf83b87f6c0c75d7ece6a13fce792d820e3c8/src/Tweetinvi.WebLogic/HttpClientWebHelper.cs#L20-L60

iamcarbon commented 2 years ago

Also seeing greater blocking activity / timeouts from HttpClient after migrating to 22.04 on ARM64. Moving back to 20.04 fixed the issues.

davidfowl commented 2 years ago

I am just using the library, and I had to track down how the library was making the connections.

File an issue on the library?

Nican commented 2 years ago

@davidfowl Thank you for your time.

I spent quite some time trying to track down this problem, as I had some free time, and I am glad .NET Core is open source.

What I found infuriating about this issue is that I could not understand what is timing out. I tried a variety of combinations, between different operating systems, and different servers/networks, and the common factors seems to be .NET 6 on Ubuntu 22.04.

Taking a look at the stack trace, and following the source code inside of the .NET library, the code seems to timeout from a mechanism internal to the .NET HttpClient code. I posted a link to where the stack trace end on the original post.

But what I find additionally infuriating is that there seems to be no context about what is timing out. Did a system call timeout? Did the kernel fail to create a handle? Did the network run out of ports? Was any packets sent? Were any packets received back? Did the remote server timeout? (I even at one point tried to packet capture with WireShark, but that did not provide any insights either)

Taking a look at GetHttp11ConnectionAsync (The end of the stack trace), it seems like that .NET has some sort of cache for usable idle connection, and for some reason, the cache seems to just fail to provide new connections.

This seems to be a case where an abstraction was made to provide an optimization, but the abstraction is completely hiding the underlying root cause, and I am in the dark of what is going on. Being open source, I am trying to provide as much detail as I can, but really, I do not have much to go other than A task was canceled..

I do not think see how this is an issue with Tweetinvi, because looking at their source, it is a pretty simple usage of HttpClient, and the library generally works with Ubuntu 20.04. And I do not think I can file a bug report with Ubuntu (or any other open source libraries Ubuntu uses), since the only thing I have to go on is A task was canceled.

To be honest, I am not sure if this is a problem with .NET 6, but .NET 6 is my current blocker because I do not have any additional context of what the issue is.

wfurt commented 2 years ago

Dod you get packet captures @Nican. There are cases when upstream firewall may cut out connection while HttpClient would see it alive. You can try to tune down ideal timers or enable TCP keep-alive. Ubuntu 22 also has OpenSSL3. While that should be supported this is whole new beast and there may be some surprises.

Nican commented 2 years ago

Dod you get packet captures @Nican. There are cases when upstream firewall may cut out connection while HttpClient would see it alive. You can try to tune down ideal timers or enable TCP keep-alive. Ubuntu 22 also has OpenSSL3. While that should be supported this is whole new beast and there may be some surprises.

Thank you for your time. I tried to do some packet capture with tcpdump and wireshark, but I am not very skilled on that front. It was also hard to distinguish what was the problem, since it only happened during high loads. I could provide you with my traces, if you wish.

I do not think it was any firewall issues, since this was just two fresh installs of Ubuntu on VMs on the same hypervisor on a home network.

iamcarbon commented 2 years ago

In our case, we're using a single HttpClient connecting to Amazon S3 with a default 60-second timeout. We're running inside of AWS on ARM64 and use a standard VPC configuration.

We're serving 100-200 requests a second under load (with bursts of up-to 1,000), and were randomly seeing clusters of 100+ requests timeouts in our error logs since updating to Ubuntu 22.04. We reverted back to 20.04 for now after we were unable to isolate a root cause.

It also looks like something deeper in the internals causing the timeout. We also suspect this may be related to OpenSSL 3.0, as we have not seen these errors on another high load internal service that communicates over HTTP.

Still pondering on ideas to isolate further...

Our stack trace:

TaskCanceledException | The request was canceled due to the configured HttpClient.Timeout of 60 seconds elapsing.

at System.Net.Http.HttpClient.HandleFailure(Exception e, Boolean telemetryStarted, HttpResponseMessage response, CancellationTokenSource cts, CancellationToken cancellationToken, CancellationTokenSource pendingRequestsCts)
at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
at Amazon.S3.S3Client.SendS3RequestAsync(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationToken cancellationToken)

Our HttpClient configuration:

 private static readonly HttpClient s3HttpClient = new (new SocketsHttpHandler {
        PooledConnectionIdleTimeout = TimeSpan.FromSeconds(15), //  S3 has a idle timeout of 20 seconds after which it closes the connection if the client is not sending data
        PooledConnectionLifetime = TimeSpan.FromMinutes(10),
        ConnectTimeout = TimeSpan.FromSeconds(5),
        MaxConnectionsPerServer = 500,
        UseCookies = false
    }) {
        Timeout = TimeSpan.FromSeconds(60)
    };
Nican commented 2 years ago

@iamcarbon The HttpClient.HandleFailure looks like the re-throw exception. Do you have details on the InnerException?

I wonder if it is also failing on GetHttp11ConnectionAsync (or something similar).

davidfowl commented 2 years ago

This is a beautiful and I think it points to the fact that we need better tools to tell you where time is being spent. We should be able to collect network events and dump something like this:

image

cc @MihaZupan

Nican commented 2 years ago

@davidfowl Thank you for for the nice words. :)

Not to get too off topic, but while we are here and talking about better tracing, may I suggest you look into OpenTelemetry and JaegerTracing.

They are already beautiful open-source momentum in providing better internal analytics of software more than flat logs. For example, CockroachDB is a SQL database that I use, and it allows you to deep dive into the execution of a SQL query (docs). In the same manner, it would be interesting to deep dive into the execution of an HttpClient.

AppInsights (Azure's analytics tool) also have some integration with OpenTelemetry (docs). As well as ElasticSearch (docs) (blog). And Microsoft already has some maintainers on OpenTelemetry .NET.

Once again, thank you for looking into this issue.

davidfowl commented 2 years ago

Thanks I'm extremely familiar with OT, Jaegar, zipkin and other tools that show timeline views for distributed traces. They don't tell you information you asked for though.

MihaZupan commented 2 years ago

Are you able to collect a trace with the following providers enabled?

--providers System.Net.Http,System.Net.NameResolution,System.Net.Sockets,System.Net.Security,Private.InternalDiagnostics.System.Net.Http,Private.InternalDiagnostics.System.Net.Sockets,Private.InternalDiagnostics.System.Net.Security,System.Threading.Tasks.TplEventSource:0x80

This can help us determine what an individual request was doing and spending time on.

MihaZupan commented 2 years ago

Since the stack trace is reaching WaitWithCancellationAsync, I would assume that the code is passing through No available HTTP/1.1 connections; request queued. log message on line #556. What queue is this? How do I control this?

If there are no available connections when you start a request, HttpClient has to go and initiate more. During this time, the request sits in an internal queue and waits for a connection to be available.

This is so that the request can be processed as soon as any connection is available, and it doesn't have to wait for the specific connection it initiated. This is a change that was made in .NET 6.

An unfortunate issue is that if your request timeout kicks in before a connection attempt actually fails, the request won't see the underlying failure. You can try increasing the HttpClient.Timeout and see if requests start failing with connection errors instead of timeouts. The traces mentioned above will also contain this info.

I do not think see how this is an issue with Tweetinvi, because looking at their source, it is a pretty simple usage of HttpClient, and the library generally works with Ubuntu 20.04.

There may very well be an issue in .NET, but it's still worth opening an issue with the library about the fact that HttpClient instances aren't being reused. Even without the Ubuntu 22 issues, it means the library will have sub-optimal performance.

Nican commented 2 years ago

Once again, thank you all for your time.

@MihaZupan I am currently on a trip. I will get you the traces next week.

I already filed a bug report with Tweetinvi (https://github.com/linvi/tweetinvi/issues/1160) last week.

ghost commented 2 years ago

This issue has been marked needs-author-action and may be missing some important information.

Nican commented 2 years ago

Apologies for the delay; I only got home from my trip last night.

I attempted to repro this morning. A clean install of Ubuntu 22.04, with .NET 6, on the same hypervisor. To my satisfaction (and dismay) I can not repro the issue.

Thank you so much for the help so far. Open source is awesome, and I hope more meaningful error messages are added to track down this particular edge case in the future.

@iamcarbon Please confirm that you are also not having the same issue.