Closed Nican closed 2 years ago
Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.
Author: | Nican |
---|---|
Assignees: | - |
Labels: | `area-System.Net.Http`, `tenet-performance`, `untriaged` |
Milestone: | - |
Update: This seems to be a problem with Ubuntu 22.04.
Same workload and same runtime under Ubuntu 20.04 looks to be working fine.
I have this program trying to create hundreds of concurrent connections to Twitter
Why?
How are you trying to create the connections?
I am just using the library, and I had to track down how the library was making the connections.
Unfortunately, the library seems to create 1 HttpClient per call: https://github.com/linvi/tweetinvi/blob/91cdf83b87f6c0c75d7ece6a13fce792d820e3c8/src/Tweetinvi.WebLogic/HttpClientWebHelper.cs#L20-L60
Also seeing greater blocking activity / timeouts from HttpClient after migrating to 22.04 on ARM64. Moving back to 20.04 fixed the issues.
I am just using the library, and I had to track down how the library was making the connections.
File an issue on the library?
@davidfowl Thank you for your time.
I spent quite some time trying to track down this problem, as I had some free time, and I am glad .NET Core is open source.
What I found infuriating about this issue is that I could not understand what is timing out. I tried a variety of combinations, between different operating systems, and different servers/networks, and the common factors seems to be .NET 6 on Ubuntu 22.04.
Taking a look at the stack trace, and following the source code inside of the .NET library, the code seems to timeout from a mechanism internal to the .NET HttpClient code. I posted a link to where the stack trace end on the original post.
But what I find additionally infuriating is that there seems to be no context about what is timing out. Did a system call timeout? Did the kernel fail to create a handle? Did the network run out of ports? Was any packets sent? Were any packets received back? Did the remote server timeout? (I even at one point tried to packet capture with WireShark, but that did not provide any insights either)
Taking a look at GetHttp11ConnectionAsync
(The end of the stack trace), it seems like that .NET has some sort of cache for usable idle connection
, and for some reason, the cache seems to just fail to provide new connections.
This seems to be a case where an abstraction was made to provide an optimization, but the abstraction is completely hiding the underlying root cause, and I am in the dark of what is going on. Being open source, I am trying to provide as much detail as I can, but really, I do not have much to go other than A task was canceled.
.
I do not think see how this is an issue with Tweetinvi
, because looking at their source, it is a pretty simple usage of HttpClient
, and the library generally works with Ubuntu 20.04. And I do not think I can file a bug report with Ubuntu (or any other open source libraries Ubuntu uses), since the only thing I have to go on is A task was canceled.
To be honest, I am not sure if this is a problem with .NET 6, but .NET 6 is my current blocker because I do not have any additional context of what the issue is.
Dod you get packet captures @Nican. There are cases when upstream firewall may cut out connection while HttpClient would see it alive. You can try to tune down ideal timers or enable TCP keep-alive. Ubuntu 22 also has OpenSSL3. While that should be supported this is whole new beast and there may be some surprises.
Dod you get packet captures @Nican. There are cases when upstream firewall may cut out connection while HttpClient would see it alive. You can try to tune down ideal timers or enable TCP keep-alive. Ubuntu 22 also has OpenSSL3. While that should be supported this is whole new beast and there may be some surprises.
Thank you for your time. I tried to do some packet capture with tcpdump
and wireshark
, but I am not very skilled on that front. It was also hard to distinguish what was the problem, since it only happened during high loads. I could provide you with my traces, if you wish.
I do not think it was any firewall issues, since this was just two fresh installs of Ubuntu on VMs on the same hypervisor on a home network.
In our case, we're using a single HttpClient connecting to Amazon S3 with a default 60-second timeout. We're running inside of AWS on ARM64 and use a standard VPC configuration.
We're serving 100-200 requests a second under load (with bursts of up-to 1,000), and were randomly seeing clusters of 100+ requests timeouts in our error logs since updating to Ubuntu 22.04. We reverted back to 20.04 for now after we were unable to isolate a root cause.
It also looks like something deeper in the internals causing the timeout. We also suspect this may be related to OpenSSL 3.0, as we have not seen these errors on another high load internal service that communicates over HTTP.
Still pondering on ideas to isolate further...
Our stack trace:
TaskCanceledException | The request was canceled due to the configured HttpClient.Timeout of 60 seconds elapsing.
at System.Net.Http.HttpClient.HandleFailure(Exception e, Boolean telemetryStarted, HttpResponseMessage response, CancellationTokenSource cts, CancellationToken cancellationToken, CancellationTokenSource pendingRequestsCts)
at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
at Amazon.S3.S3Client.SendS3RequestAsync(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationToken cancellationToken)
Our HttpClient configuration:
private static readonly HttpClient s3HttpClient = new (new SocketsHttpHandler {
PooledConnectionIdleTimeout = TimeSpan.FromSeconds(15), // S3 has a idle timeout of 20 seconds after which it closes the connection if the client is not sending data
PooledConnectionLifetime = TimeSpan.FromMinutes(10),
ConnectTimeout = TimeSpan.FromSeconds(5),
MaxConnectionsPerServer = 500,
UseCookies = false
}) {
Timeout = TimeSpan.FromSeconds(60)
};
@iamcarbon The HttpClient.HandleFailure
looks like the re-throw exception. Do you have details on the InnerException
?
I wonder if it is also failing on GetHttp11ConnectionAsync
(or something similar).
This is a beautiful and I think it points to the fact that we need better tools to tell you where time is being spent. We should be able to collect network events and dump something like this:
cc @MihaZupan
@davidfowl Thank you for for the nice words. :)
Not to get too off topic, but while we are here and talking about better tracing, may I suggest you look into OpenTelemetry and JaegerTracing.
They are already beautiful open-source momentum in providing better internal analytics of software more than flat logs. For example, CockroachDB is a SQL database that I use, and it allows you to deep dive into the execution of a SQL query (docs). In the same manner, it would be interesting to deep dive into the execution of an HttpClient
.
AppInsights (Azure's analytics tool) also have some integration with OpenTelemetry (docs). As well as ElasticSearch (docs) (blog). And Microsoft already has some maintainers on OpenTelemetry .NET.
Once again, thank you for looking into this issue.
Thanks I'm extremely familiar with OT, Jaegar, zipkin and other tools that show timeline views for distributed traces. They don't tell you information you asked for though.
Are you able to collect a trace with the following providers enabled?
--providers System.Net.Http,System.Net.NameResolution,System.Net.Sockets,System.Net.Security,Private.InternalDiagnostics.System.Net.Http,Private.InternalDiagnostics.System.Net.Sockets,Private.InternalDiagnostics.System.Net.Security,System.Threading.Tasks.TplEventSource:0x80
This can help us determine what an individual request was doing and spending time on.
Since the stack trace is reaching WaitWithCancellationAsync, I would assume that the code is passing through No available HTTP/1.1 connections; request queued. log message on line #556. What queue is this? How do I control this?
If there are no available connections when you start a request, HttpClient
has to go and initiate more. During this time, the request sits in an internal queue and waits for a connection to be available.
This is so that the request can be processed as soon as any connection is available, and it doesn't have to wait for the specific connection it initiated. This is a change that was made in .NET 6.
An unfortunate issue is that if your request timeout kicks in before a connection attempt actually fails, the request won't see the underlying failure.
You can try increasing the HttpClient.Timeout
and see if requests start failing with connection errors instead of timeouts. The traces mentioned above will also contain this info.
I do not think see how this is an issue with Tweetinvi, because looking at their source, it is a pretty simple usage of HttpClient, and the library generally works with Ubuntu 20.04.
There may very well be an issue in .NET, but it's still worth opening an issue with the library about the fact that HttpClient
instances aren't being reused. Even without the Ubuntu 22 issues, it means the library will have sub-optimal performance.
Once again, thank you all for your time.
@MihaZupan I am currently on a trip. I will get you the traces next week.
I already filed a bug report with Tweetinvi (https://github.com/linvi/tweetinvi/issues/1160) last week.
This issue has been marked needs-author-action
and may be missing some important information.
Apologies for the delay; I only got home from my trip last night.
I attempted to repro this morning. A clean install of Ubuntu 22.04, with .NET 6, on the same hypervisor. To my satisfaction (and dismay) I can not repro the issue.
Thank you so much for the help so far. Open source is awesome, and I hope more meaningful error messages are added to track down this particular edge case in the future.
@iamcarbon Please confirm that you are also not having the same issue.
Description
I am failing to create a lot of concurrent connections. I have this program trying to create hundreds of concurrent connections to Twitter, and it seems to be throttled by some internal throttling mechanism that I am not sure how to control.
The stack trace:
Here is the source code for Tweetinvi, and it seems to be making a new
HttpClient
per request. (Which is inefficient, but does not seem to be the problem)Taking a look at GetHttp11ConnectionAsync, it seems to be stuck on a
_http11RequestQueue
. But I can not seem to understand how to control this queue.Some additional details of why the request has timed out would be nice.
Configuration
.NET 6 on Ubuntu 22.04 x64, compiled to a native binary.
Regression?
Data
Analysis
Since the stack trace is reaching
WaitWithCancellationAsync
, I would assume that the code is passing throughNo available HTTP/1.1 connections; request queued.
log message on line #556. What queue is this? How do I control this?I have also added
To the first line after
Main
. Andto the
HttpClient
. (After having to recompile Tweetinvi)