ConcurrentQueue is HOT in TechEmpower profiles for machines with MANY cores

adamsitnik commented 4 years ago

In #35330 and #35800 we have changed the crucial part of Sockets implementation on Linux which allowed us to get some really nice gains in the vast majority of the Benchmarks and hardware configurations.

Today I've looked into the AMD numbers (the AMD machine was off when we were working on recent changes) and it looks like overall 5.0 is much faster than 3.1 but the recent changes have slightly decreased the performance:

obraz

I've quickly profiled it by running the following command:

dotnet run -- --server http://$secret1 --client $secret2 --connections 512 --jobs ..\BenchmarksApps\Kestrel\PlatformBenchmarks\benchmarks.json.json --scenario JsonPlatform  --sdk 5.0.100-preview.5.20264.2 --runtime  5.0.0-preview.6.20262.14  --collect-trace

15% of the total exclusive CPU time is spent in two ConcurrentQueue methods:

obraz

This machine has 46 cores. Even if I set the number of epoll threads to the old value, we never get 100% CPU utilization. In fact even before our recent changes we never did. I've even run Netty benchmarks and it's also struggling - it's consuming only 39% of CPU.

On an Intel machine with 56 cores we spent a similar amount of time in these two methods:

obraz

But despite that, the recent changes have almost doubled the throughput on this machine.

On 28 core Intel machine it's less than 3% in total:

obraz

I believe that this phenomenon requires further investigation.

Perhaps we should give a single-producer-multiple-consumer concurrent queue a try (a suggestion from @stephentoub from a while ago)?

cc @stephentoub @kouvel @tannergooding @tmds @benaadams

ghost commented 4 years ago

Tagging subscribers to this area: @eiriktsarpalis Notify danmosemsft if you want to be subscribed.

stephentoub commented 4 years ago

Perhaps we should give a single-producer-multiple-consumer concurrent queue a try

You could experiment quickly with that by using https://github.com/dotnet/runtime/blob/master/src/libraries/Common/src/System/Collections/Concurrent/SingleProducerConsumerQueue.cs and guarding access to TryDequeues with a SpinLock or Monitor. It'd require more investigation to find or design and implement a new SPMC queue.

benaadams commented 4 years ago

Plaintext platform also dropped; latency is much lower, but CPU is ~50% on AMD

Also dropped a bit on the smaller Intel

benaadams commented 4 years ago

Which are the callers of ConcurrentQueue in inclusive stacks; as its used in several places (including ThreadPool and IOQueue)

stephentoub commented 4 years ago

Which are the callers of ConcurrentQueue in inclusive stacks; as its used in several places (including ThreadPool and IOQueue)

That's a good question; I just assumed Adam was referring to the queue added as part of the previous change, and that's what my comments were in reference to. But you're right that these could be coming from multiple other places.

jkotas commented 4 years ago

You may be seeing effect of AMD being more NUMA than Intel. If it is the case, we should see similar results on ARM64 and likely going to see similar trend on Intel as it becomes more NUMA by adding more cores. NUMA is laws of physics. It just does not work well for 100+ cores to be going to through the same memory pipe.

We may need to get the thread pool and the socket engines to be NUMA aware to address this scalability bottleneck. The other option is to tell people to run one instance per NUMA node for best performance. (Running one instance per NUMA node is likely provide better scalability in most cases even if we do the NUMA optimizations.)

NUMA is configurable on AMD machines. To get more data, you can try to experiment with different NUMA configuration to see how it affects the results.

Some more info: https://www.amd.com/system/files/2018-03/AMD-Optimizes-EPYC-Memory-With-NUMA.pdf

adamsitnik commented 4 years ago

Which are the callers of ConcurrentQueue in inclusive stacks; as its used in several places

That's a very good question.

obraz

I am going to test it with different number of IOQueue threads which is limited to 16.

https://github.com/dotnet/aspnetcore/blob/cef755fd8251a1a869b7b892603adb772993546b/src/Servers/Kestrel/Transport.Sockets/src/SocketTransportOptions.cs#L17

adamsitnik commented 4 years ago

I've finished running the benchmarks with IOQueue thread count set to Environment.ProcessorCount / 2 for JSONPlatform with 512 connections:

IOQueue	16	cpus / 2
Citrine 28 cores	1,146,236	1,126,055
Mono 56 cores	1,047,780	1,225,314
AMD 46 cores	676,425	738,437

tmds commented 4 years ago

15% of the total exclusive CPU time is spent in two ConcurrentQueue methods:

Maybe also interesting to investigate: 8.6% spent in CLRLifoSemaphore::Wait.

we never get 100% CPU utilization.

Can we figure out why we go idle?

You may be seeing effect of AMD being more NUMA than Intel.

NUMA makes synchronization (like locks, and Interlocked operation) more costly, right?

jkotas commented 4 years ago

NUMA makes synchronization (like locks, and Interlocked operation) more costly, right?

Yep, it is not just synchronization - any memory access is more expensive when different NUMA nodes are accessing same memory.

rducom commented 4 years ago

any memory access is more expensive when different NUMA nodes are accessing same memory.

This NUMA data locality / affinity subject could be a dedicated GH issue, since there's lot of sub-subjects and consequences. (and some issues about it)

On one side, processors are being less and less monolithics either on AMD or Intel side (Ryzen inter-CCX is twice the intra-CCX latency), on the other side we are using more and more multi-socket platforms (2, 4, 8 sockets are the new standard). https://twitter.com/IanCutress/status/1226519413978992640 https://www.anandtech.com/show/15785/the-intel-comet-lake-review-skylake-we-go-again/4 Our current production servers are actually 2 socket xeons, and we keep our Vmware VMs with cores count < 1 socket cores (same for RAM) to avoid inter-socket latency-penalty in non-numa-aware workloads (like .NET). Our only VM using all cores (2 sockets) are the SQL-Server ones (which is numa aware).

adamsitnik commented 4 years ago

Based on an offline discussion that contained EPYC numbers that can't be shared publicly, I can just say that #44265 helped with this issue.

@sebastienros please let me know when our high-core count CPU machines are online again ;)

ghost commented 3 years ago

Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process.

This process is part of the experimental issue cleanup initiative we are currently trialing in a limited number of areas. Please share any feedback you might have in the linked issue.

sebastienros commented 3 years ago

I want @kouvel to decide to close this or not and not the bot.

dotnet / runtime

ConcurrentQueue is HOT in TechEmpower profiles for machines with MANY cores #36447