Closed MihaZupan closed 6 months ago
Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.
Author: | MihaZupan |
---|---|
Assignees: | - |
Labels: | `area-System.Net.Http`, `tenet-performance`, `untriaged` |
Milestone: | - |
Are you able to run a similar test on previous versions, e.g. .NET Core 3.1 vs .NET 5 vs .NET 6 vs .NET 7? I'm curious if this has changed recently.
Are you able to run a similar test on previous versions, e.g. .NET Core 3.1 vs .NET 5 vs .NET 6 vs .NET 7? I'm curious if this has changed recently.
Needs tiny changes in the benchmark app, I'll do that
From a quick test it looks like the Threads=1 case is significantly improved with every release whereas contention (% drop depending on number of threads) is very similar.
.NET 5, 6 and 7 compared to (theoretical) perfect scaling
Triage: Contention in HTTP/1.1. Might be similar problems in HTTP/2 and HTTP/3, but they are not measured / tracked here. Worth taking a look, but not high pri as the benchmark above is synthetic.
As an FYI - numbers for HTTP/2 on .NET 6, scenario is GET with OK (no content) from the server: | client | 1x256 | 8x32 | |
---|---|---|---|---|
CPU Usage (%) | 69 | 93 | +34.78% | |
Requests | 2,304,474 | 5,281,524 | +129.19% | |
Mean RPS | 153,626 | 352,137 | +129.22% |
crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/httpclient.benchmarks.yml --scenario httpclient-kestrel-get --profile aspnet-citrine-lin --variable useHttpMessageInvoker=true --variable useHttps=true --variable httpVersion="2.0" --variable concurrencyPerHttpClient=256 --variable numberOfHttpClients=1 --json 1x256.json
crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/httpclient.benchmarks.yml --scenario httpclient-kestrel-get --profile aspnet-citrine-lin --variable useHttpMessageInvoker=true --variable useHttps=true --variable httpVersion="2.0" --variable concurrencyPerHttpClient=32 --variable numberOfHttpClients=8 --json 8x32.json
Scratch that, I forgot the huge H2 improvements in Kestrel; but the gap still persists: This is both client and server on 7.0 nightly build:
client | 1x256 | 8x32 | |
---|---|---|---|
CPU Usage (%) | 70 | 90 | +28.57% |
Requests | 4,850,055 | 8,609,368 | +77.51% |
Mean RPS | 323,367 | 574,032 | +77.52% |
crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/httpclient.benchmarks.yml --scenario httpclient-kestrel-get --profile aspnet-citrine-lin --variable useHttpMessageInvoker=true --variable useHttps=true --variable httpVersion="2.0" --variable concurrencyPerHttpClient=256 --variable numberOfHttpClients=1 --server.framework net7.0 --server.channel latest --client.framework net7.0 --client.channel latest --json 1x256.json
crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/httpclient.benchmarks.yml --scenario httpclient-kestrel-get --profile aspnet-citrine-lin --variable useHttpMessageInvoker=true --variable useHttps=true --variable httpVersion="2.0" --variable concurrencyPerHttpClient=32 --variable numberOfHttpClients=8 --server.framework net7.0 --server.channel latest --client.framework net7.0 --client.channel latest --json 8x32.json
We are seeing this under load as well in our production environments. In one scenario we actually created a "proxy" in GoLang that took a bunch of GET requests as a POST reducing 25 HTTP connections to 1 from the .NET perspective and let GoLang deal with heavy load. Pretty sad that we had to do that, but it helped, but in other scenarios it is more complicated and we can't go that route.
We have an HTTP Varnish cache and a decent number of our responses are less than 1 MS but we will see contention at times turn that into 40+ms, which kills performance and our ability to handle high loads as operations take so much longer.
One thing we are thinking about doing is adding multiple "destinations" aka host entries to maybe alleviate the contention, this reminds me of the old browser days when you could only have so many concurrent requests per domain.
We tried HTTP/2 but that didn't seem to help and under heavy load we were getting "accessing a disposed object" errors, which forced us back to 1.1
Things we've noticed
DNS contention - Our thoughts is the framework should allow us dictate DNS behavior. These DNS entries never change as they are service in the K8s cluster, it rarely changes and if it did, all the pods would have been destroyed beforehand anyway.
SSL contention - We switched to HTTP as again, these are internal calls to the cluster.
Other things we've thought about doing is move to UDS and proxy to Istio/Envoy.
Thoughts on a fix and thoughts on multiple DNS entries?
Playing around with a (lightly tested) reimplementation of the HTTP/1.1 pool that boils down to ConcurrentQueue.Enqueue
+ ConcurrentQueue.TryDequeue
on the happy path: https://github.com/MihaZupan/runtime/commit/26432354838eab4cda8681e3269138f1e92f470c.
crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/httpclient.benchmarks.yml --scenario httpclient-kestrel-get --profile aspnet-citrine-lin --variable useHttpMessageInvoker=true --variable concurrencyPerHttpClient=256 --variable numberOfHttpClients=1 --server.framework net9.0 --client.framework net9.0 --json 1x256.json
crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/httpclient.benchmarks.yml --scenario httpclient-kestrel-get --profile aspnet-citrine-lin --variable useHttpMessageInvoker=true --variable concurrencyPerHttpClient=32 --variable numberOfHttpClients=8 --server.framework net9.0 --client.framework net9.0 --json 8x32.json
crank compare .\1x256.json .\8x32.json
client | 1x256 | 8x32 | |
---|---|---|---|
RPS | 693,873 | 875,814 | +26.22% |
Patched RPS | 873,571 | 876,394 | +0.32% |
Or YARP's http-http 100 byte scenario
load | yarp-base | yarp-patched | |
---|---|---|---|
Latency 50th (ms) | 0.73 | 0.68 | -6.97% |
Latency 75th (ms) | 0.82 | 0.74 | -9.82% |
Latency 90th (ms) | 1.03 | 0.89 | -13.39% |
Latency 95th (ms) | 1.41 | 1.18 | -16.41% |
Latency 99th (ms) | 2.87 | 2.68 | -6.63% |
Mean latency (ms) | 0.83 | 0.76 | -8.74% |
Requests/sec | 306,699 | 335,921 | +9.53% |
On an in-memory loopback benchmark that stresses the connection pool contention: https://gist.github.com/MihaZupan/27f01d78c71da7b9024b321e743e3d88
Rough RPS numbers with 1-6 threads:
RPS (1000s) | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
main | 2060 | 1900 | 1760 | 1670 | 1570 | 1500 |
patched | 2150 | 2600 | 3400 | 3700 | 4100 | 4260 |
@stephentoub @MihaZupan Is there any chance that this fix might get cherry-picked back to .NET 8? /CC @aceven24-csgp
Why do you ask? Are you seeing such issues with your service?
It's unlikely that we would backport this change. This isn't a correctness fix - the behavior is the same. It's a performance improvement that only matters at very high request throughput that won't be reached by most applications.
@MihaZupan yes we've been seeing these issues since .NET 6 timeframe (see comment added by @aceven24-csgp above). Our application has very high request throughput. Our company typically only aligns with LTS releases hence the ask about back porting to .NET 8, rather than waiting to what would be .NET 10 for us.
Have you tried manually splitting requests over multiple SocketsHttpHandler
instances? E.g. use 8 instances instead of 1 and randomly switch between them?
That would achieve a similar thing as this change at high load. And if it doesn't, you may be running into different issues.
Yes, we have tried something similar to what you suggested and saw improvements, so we do have a reasonable workaround for now /cc @kevinstapleton
Hi @MihaZupan,
This is somewhat anecdotal now as we implemented the workaround over a year ago, but as @ozziepeeps mentioned, we were seeing heavy contention on the locks within the HTTP connection pool which appeared to be limiting our throughput, primarily when creating new connections. To be specific, in our trace captures, we would see many "RequestLeftQueue" events with high queue times. As a PoC to prove out whether this was affecting us or not, we created a custom handler which had a "pool" of SocketsHttpHandler
instances implemented via a ConcurrentQueue
. Using this custom handler practically eliminated the observed contention and removed our throughput bottleneck.
So while we don't need this to be back-ported, we are indeed eager to have a proper fix in the runtime for this. If anything, I think our success with using ConcurrentQueue
echoes the success you've seen with ConcurrentStack
.
we would see many "RequestLeftQueue" events with high queue times
Seeing this during steady state when using HTTP/1 indicates that you didn't have enough connections to handle all requests.
Are you using MaxConnectionsPerServer
?
The cost of contention around starting new connections should be orders of magnitude lower than that of establishing new connections themselves. I'd be surprised if the pool contention would be the primary issue there.
The changes I made here are really focused on the "happy path" where we're able to open enough connections to handle the load so eventually we're just processing requests without having to open any new connections. If you were limited by the number of connections, the recent changes wouldn't help with contention.
When multiple requests use the same connection pool (same handler + same destination), they will hit a bunch of contention when acquiring and releasing the connection from the pool.
A quick example comparing the HttpClient benchmark using 1 vs 8 handlers, both with 256 worker tasks, shows a ~13% difference.
I saw similar numbers (a ~15% E2E difference) in my LLRP experiment when switching to using multiple handlers.
Approximate numbers from a benchmark doing minimal work outside of spamming the connection pool:
On my machine (8 core CPU) the Threads=6 scenario is spending 57 % of CPU on getting the lock in
GetHttp11ConnectionAsync
andReturnHttp11Connection
.We should look into reducing the contention here if possible.