HttpConnectionPool contention for HTTP/1.1

MihaZupan commented 2 years ago

When multiple requests use the same connection pool (same handler + same destination), they will hit a bunch of contention when acquiring and releasing the connection from the pool.

A quick example comparing the HttpClient benchmark using 1 vs 8 handlers, both with 256 worker tasks, shows a ~13% difference.

crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/httpclient.benchmarks.yml --scenario httpclient-kestrel-get --profile aspnet-citrine-lin --variable useHttpMessageInvoker=true --variable concurrencyPerHttpClient=256 --variable numberOfHttpClients=1 --json 1x256.json
crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/httpclient.benchmarks.yml --scenario httpclient-kestrel-get --profile aspnet-citrine-lin --variable useHttpMessageInvoker=true --variable concurrencyPerHttpClient=32 --variable numberOfHttpClients=8 --json 8x32.json
crank compare .\1x256.json .\8x32.json

client	1x256	8x32
CPU Usage (%)	75	95	+26,67%
Requests	8.653.509	9.814.549	+13,42%
Mean RPS	576.751	654.417	+13,47%

I saw similar numbers (a ~15% E2E difference) in my LLRP experiment when switching to using multiple handlers.

Approximate numbers from a benchmark doing minimal work outside of spamming the connection pool:

Threads	RPS	RPS/thread
1	620k	620k
2	1120k	560k
3	1500k	500k
4	1720k	430k
5	1370k	274k
6	1170k	195k
7	1080k	154k

On my machine (8 core CPU) the Threads=6 scenario is spending 57 % of CPU on getting the lock in GetHttp11ConnectionAsync and ReturnHttp11Connection.

We should look into reducing the contention here if possible.

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.

Issue Details

When multiple requests use the same connection pool (same handler + same destination), they will hit a bunch of contention when acquiring and releasing the connection from the pool. A quick example comparing the HttpClient benchmark using 1 vs 8 handlers, both with 256 worker tasks, shows a ~13% difference. ``` crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/httpclient.benchmarks.yml --scenario httpclient-kestrel-get --profile aspnet-citrine-lin --variable useHttpMessageInvoker=true --variable concurrencyPerHttpClient=256 --variable numberOfHttpClients=1 --json 1x256.json crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/httpclient.benchmarks.yml --scenario httpclient-kestrel-get --profile aspnet-citrine-lin --variable useHttpMessageInvoker=true --variable concurrencyPerHttpClient=32 --variable numberOfHttpClients=8 --json 8x32.json crank compare .\1x256.json .\8x32.json ``` | client | 1x256 | 8x32 | | | --------------------------- | --------- | --------- | ------- | | CPU Usage (%) | 75 | 95 | +26,67% | | Requests | 8.653.509 | 9.814.549 | +13,42% | | Mean RPS | 576.751 | 654.417 | +13,47% | I saw similar numbers (a ~15% E2E difference) in my [LLRP experiment](https://github.com/MihaZupan/LLRP) when switching to using multiple handlers. Approximate numbers from a [benchmark](https://gist.github.com/MihaZupan/5dd40ca711479593fc0b6753a76688a3) doing minimal work outside of spamming the connection pool: | Threads | RPS | RPS/thread | |---------|-----|------------| | 1 | 620k | 620k | | 2 | 1120k | 560k | | 3 | 1500k | 500k | | 4 | 1720k | 430k | | 5 | 1370k | 274k | | 6 | 1170k | 195k | | 7 | 1080k | 154k | On my machine (8 core CPU) the Threads=6 scenario is spending **57 %** of CPU on getting the lock in [`GetHttp11ConnectionAsync`](https://github.com/dotnet/runtime/blob/41c1c1040f1dfd9052dc32b74aa6db175394f7ca/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/HttpConnectionPool.cs#L510) and [`ReturnHttp11Connection`](https://github.com/dotnet/runtime/blob/41c1c1040f1dfd9052dc32b74aa6db175394f7ca/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/HttpConnectionPool.cs#L1731). ![image](https://user-images.githubusercontent.com/25307628/171469682-acabdd47-4c25-4ade-9f0e-b5a0ebf20f74.png) We should look into reducing the contention here if possible.

Author:	MihaZupan
Assignees:	-
Labels:	`area-System.Net.Http`, `tenet-performance`, `untriaged`
Milestone:	-

stephentoub commented 2 years ago

Are you able to run a similar test on previous versions, e.g. .NET Core 3.1 vs .NET 5 vs .NET 6 vs .NET 7? I'm curious if this has changed recently.

CarnaViire commented 2 years ago

Are you able to run a similar test on previous versions, e.g. .NET Core 3.1 vs .NET 5 vs .NET 6 vs .NET 7? I'm curious if this has changed recently.

Needs tiny changes in the benchmark app, I'll do that

MihaZupan commented 2 years ago

From a quick test it looks like the Threads=1 case is significantly improved with every release whereas contention (% drop depending on number of threads) is very similar.

MihaZupan commented 2 years ago

.NET 5, 6 and 7 compared to (theoretical) perfect scaling

karelz commented 2 years ago

Triage: Contention in HTTP/1.1. Might be similar problems in HTTP/2 and HTTP/3, but they are not measured / tracked here. Worth taking a look, but not high pri as the benchmark above is synthetic.

CarnaViire commented 2 years ago

As an FYI - numbers for HTTP/2 on .NET 6, scenario is GET with OK (no content) from the server:	client	1x256	8x32
CPU Usage (%)	69	93	+34.78%
Requests	2,304,474	5,281,524	+129.19%
Mean RPS	153,626	352,137	+129.22%

crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/httpclient.benchmarks.yml --scenario httpclient-kestrel-get --profile aspnet-citrine-lin --variable useHttpMessageInvoker=true --variable useHttps=true --variable httpVersion="2.0" --variable concurrencyPerHttpClient=256 --variable numberOfHttpClients=1 --json 1x256.json
crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/httpclient.benchmarks.yml --scenario httpclient-kestrel-get --profile aspnet-citrine-lin --variable useHttpMessageInvoker=true --variable useHttps=true --variable httpVersion="2.0" --variable concurrencyPerHttpClient=32 --variable numberOfHttpClients=8 --json 8x32.json

CarnaViire commented 2 years ago

Scratch that, I forgot the huge H2 improvements in Kestrel; but the gap still persists: This is both client and server on 7.0 nightly build:

client	1x256	8x32
CPU Usage (%)	70	90	+28.57%
Requests	4,850,055	8,609,368	+77.51%
Mean RPS	323,367	574,032	+77.52%

crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/httpclient.benchmarks.yml --scenario httpclient-kestrel-get --profile aspnet-citrine-lin --variable useHttpMessageInvoker=true --variable useHttps=true --variable httpVersion="2.0" --variable concurrencyPerHttpClient=256 --variable numberOfHttpClients=1 --server.framework net7.0 --server.channel latest --client.framework net7.0 --client.channel latest --json 1x256.json
crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/httpclient.benchmarks.yml --scenario httpclient-kestrel-get --profile aspnet-citrine-lin --variable useHttpMessageInvoker=true --variable useHttps=true --variable httpVersion="2.0" --variable concurrencyPerHttpClient=32 --variable numberOfHttpClients=8 --server.framework net7.0 --server.channel latest  --client.framework net7.0 --client.channel latest --json 8x32.json

aceven24-csgp commented 1 year ago

We are seeing this under load as well in our production environments. In one scenario we actually created a "proxy" in GoLang that took a bunch of GET requests as a POST reducing 25 HTTP connections to 1 from the .NET perspective and let GoLang deal with heavy load. Pretty sad that we had to do that, but it helped, but in other scenarios it is more complicated and we can't go that route.

We have an HTTP Varnish cache and a decent number of our responses are less than 1 MS but we will see contention at times turn that into 40+ms, which kills performance and our ability to handle high loads as operations take so much longer.

One thing we are thinking about doing is adding multiple "destinations" aka host entries to maybe alleviate the contention, this reminds me of the old browser days when you could only have so many concurrent requests per domain.

We tried HTTP/2 but that didn't seem to help and under heavy load we were getting "accessing a disposed object" errors, which forced us back to 1.1

Things we've noticed

DNS contention - Our thoughts is the framework should allow us dictate DNS behavior. These DNS entries never change as they are service in the K8s cluster, it rarely changes and if it did, all the pods would have been destroyed beforehand anyway.

SSL contention - We switched to HTTP as again, these are internal calls to the cluster.

Other things we've thought about doing is move to UDS and proxy to Istio/Envoy.

Thoughts on a fix and thoughts on multiple DNS entries?

MihaZupan commented 6 months ago

Playing around with a (lightly tested) reimplementation of the HTTP/1.1 pool that boils down to ConcurrentQueue.Enqueue + ConcurrentQueue.TryDequeue on the happy path: https://github.com/MihaZupan/runtime/commit/26432354838eab4cda8681e3269138f1e92f470c.

crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/httpclient.benchmarks.yml --scenario httpclient-kestrel-get --profile aspnet-citrine-lin --variable useHttpMessageInvoker=true --variable concurrencyPerHttpClient=256 --variable numberOfHttpClients=1 --server.framework net9.0 --client.framework net9.0 --json 1x256.json
crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/httpclient.benchmarks.yml --scenario httpclient-kestrel-get --profile aspnet-citrine-lin --variable useHttpMessageInvoker=true --variable concurrencyPerHttpClient=32 --variable numberOfHttpClients=8 --server.framework net9.0 --client.framework net9.0 --json 8x32.json
crank compare .\1x256.json .\8x32.json

client	1x256	8x32
RPS	693,873	875,814	+26.22%
Patched RPS	873,571	876,394	+0.32%

Or YARP's http-http 100 byte scenario

load	yarp-base	yarp-patched
Latency 50th (ms)	0.73	0.68	-6.97%
Latency 75th (ms)	0.82	0.74	-9.82%
Latency 90th (ms)	1.03	0.89	-13.39%
Latency 95th (ms)	1.41	1.18	-16.41%
Latency 99th (ms)	2.87	2.68	-6.63%
Mean latency (ms)	0.83	0.76	-8.74%
Requests/sec	306,699	335,921	+9.53%

On an in-memory loopback benchmark that stresses the connection pool contention: https://gist.github.com/MihaZupan/27f01d78c71da7b9024b321e743e3d88

Rough RPS numbers with 1-6 threads:

RPS (1000s)	1	2	3	4	5	6
main	2060	1900	1760	1670	1570	1500
patched	2150	2600	3400	3700	4100	4260

ozziepeeps commented 6 months ago

@stephentoub @MihaZupan Is there any chance that this fix might get cherry-picked back to .NET 8? /CC @aceven24-csgp

MihaZupan commented 6 months ago

Why do you ask? Are you seeing such issues with your service?

It's unlikely that we would backport this change. This isn't a correctness fix - the behavior is the same. It's a performance improvement that only matters at very high request throughput that won't be reached by most applications.

ozziepeeps commented 6 months ago

@MihaZupan yes we've been seeing these issues since .NET 6 timeframe (see comment added by @aceven24-csgp above). Our application has very high request throughput. Our company typically only aligns with LTS releases hence the ask about back porting to .NET 8, rather than waiting to what would be .NET 10 for us.

MihaZupan commented 6 months ago

Have you tried manually splitting requests over multiple SocketsHttpHandler instances? E.g. use 8 instances instead of 1 and randomly switch between them?

That would achieve a similar thing as this change at high load. And if it doesn't, you may be running into different issues.

ozziepeeps commented 6 months ago

Yes, we have tried something similar to what you suggested and saw improvements, so we do have a reasonable workaround for now /cc @kevinstapleton

kevinstapleton commented 6 months ago

Hi @MihaZupan,

This is somewhat anecdotal now as we implemented the workaround over a year ago, but as @ozziepeeps mentioned, we were seeing heavy contention on the locks within the HTTP connection pool which appeared to be limiting our throughput, primarily when creating new connections. To be specific, in our trace captures, we would see many "RequestLeftQueue" events with high queue times. As a PoC to prove out whether this was affecting us or not, we created a custom handler which had a "pool" of SocketsHttpHandler instances implemented via a ConcurrentQueue. Using this custom handler practically eliminated the observed contention and removed our throughput bottleneck.

So while we don't need this to be back-ported, we are indeed eager to have a proper fix in the runtime for this. If anything, I think our success with using ConcurrentQueue echoes the success you've seen with ConcurrentStack.

MihaZupan commented 6 months ago

we would see many "RequestLeftQueue" events with high queue times

Seeing this during steady state when using HTTP/1 indicates that you didn't have enough connections to handle all requests. Are you using MaxConnectionsPerServer? The cost of contention around starting new connections should be orders of magnitude lower than that of establishing new connections themselves. I'd be surprised if the pool contention would be the primary issue there.

The changes I made here are really focused on the "happy path" where we're able to open enough connections to handle the load so eventually we're just processing requests without having to open any new connections. If you were limited by the number of connections, the recent changes wouldn't help with contention.

dotnet / runtime

HttpConnectionPool contention for HTTP/1.1 #70098