Open adamsitnik opened 4 years ago
Tagging subscribers to this area: @eiriktsarpalis Notify danmosemsft if you want to be subscribed.
Perhaps we should give a single-producer-multiple-consumer concurrent queue a try
You could experiment quickly with that by using https://github.com/dotnet/runtime/blob/master/src/libraries/Common/src/System/Collections/Concurrent/SingleProducerConsumerQueue.cs and guarding access to TryDequeues with a SpinLock or Monitor. It'd require more investigation to find or design and implement a new SPMC queue.
Plaintext platform also dropped; latency is much lower, but CPU is ~50% on AMD
Also dropped a bit on the smaller Intel
Which are the callers of ConcurrentQueue
in inclusive stacks; as its used in several places (including ThreadPool and IOQueue)
Which are the callers of ConcurrentQueue in inclusive stacks; as its used in several places (including ThreadPool and IOQueue)
That's a good question; I just assumed Adam was referring to the queue added as part of the previous change, and that's what my comments were in reference to. But you're right that these could be coming from multiple other places.
You may be seeing effect of AMD being more NUMA than Intel. If it is the case, we should see similar results on ARM64 and likely going to see similar trend on Intel as it becomes more NUMA by adding more cores. NUMA is laws of physics. It just does not work well for 100+ cores to be going to through the same memory pipe.
We may need to get the thread pool and the socket engines to be NUMA aware to address this scalability bottleneck. The other option is to tell people to run one instance per NUMA node for best performance. (Running one instance per NUMA node is likely provide better scalability in most cases even if we do the NUMA optimizations.)
NUMA is configurable on AMD machines. To get more data, you can try to experiment with different NUMA configuration to see how it affects the results.
Some more info: https://www.amd.com/system/files/2018-03/AMD-Optimizes-EPYC-Memory-With-NUMA.pdf
Which are the callers of ConcurrentQueue in inclusive stacks; as its used in several places
That's a very good question.
I am going to test it with different number of IOQueue
threads which is limited to 16.
I've finished running the benchmarks with IOQueue thread count set to Environment.ProcessorCount / 2
for JSONPlatform with 512 connections:
IOQueue | 16 | cpus / 2 |
---|---|---|
Citrine 28 cores | 1,146,236 | 1,126,055 |
Mono 56 cores | 1,047,780 | 1,225,314 |
AMD 46 cores | 676,425 | 738,437 |
15% of the total exclusive CPU time is spent in two ConcurrentQueue methods:
Maybe also interesting to investigate: 8.6% spent in CLRLifoSemaphore::Wait
.
we never get 100% CPU utilization.
Can we figure out why we go idle?
You may be seeing effect of AMD being more NUMA than Intel.
NUMA makes synchronization (like locks, and Interlocked operation) more costly, right?
NUMA makes synchronization (like locks, and Interlocked operation) more costly, right?
Yep, it is not just synchronization - any memory access is more expensive when different NUMA nodes are accessing same memory.
any memory access is more expensive when different NUMA nodes are accessing same memory.
This NUMA data locality / affinity subject could be a dedicated GH issue, since there's lot of sub-subjects and consequences. (and some issues about it)
On one side, processors are being less and less monolithics either on AMD or Intel side (Ryzen inter-CCX is twice the intra-CCX latency), on the other side we are using more and more multi-socket platforms (2, 4, 8 sockets are the new standard). https://twitter.com/IanCutress/status/1226519413978992640 https://www.anandtech.com/show/15785/the-intel-comet-lake-review-skylake-we-go-again/4 Our current production servers are actually 2 socket xeons, and we keep our Vmware VMs with cores count < 1 socket cores (same for RAM) to avoid inter-socket latency-penalty in non-numa-aware workloads (like .NET). Our only VM using all cores (2 sockets) are the SQL-Server ones (which is numa aware).
Based on an offline discussion that contained EPYC numbers that can't be shared publicly, I can just say that #44265 helped with this issue.
@sebastienros please let me know when our high-core count CPU machines are online again ;)
Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process.
This process is part of the experimental issue cleanup initiative we are currently trialing in a limited number of areas. Please share any feedback you might have in the linked issue.
I want @kouvel to decide to close this or not and not the bot.
In #35330 and #35800 we have changed the crucial part of Sockets implementation on Linux which allowed us to get some really nice gains in the vast majority of the Benchmarks and hardware configurations.
Today I've looked into the AMD numbers (the AMD machine was off when we were working on recent changes) and it looks like overall 5.0 is much faster than 3.1 but the recent changes have slightly decreased the performance:
I've quickly profiled it by running the following command:
15% of the total exclusive CPU time is spent in two
ConcurrentQueue
methods:This machine has 46 cores. Even if I set the number of epoll threads to the old value, we never get 100% CPU utilization. In fact even before our recent changes we never did. I've even run Netty benchmarks and it's also struggling - it's consuming only 39% of CPU.
On an Intel machine with 56 cores we spent a similar amount of time in these two methods:
But despite that, the recent changes have almost doubled the throughput on this machine.
On 28 core Intel machine it's less than 3% in total:
I believe that this phenomenon requires further investigation.
Perhaps we should give a single-producer-multiple-consumer concurrent queue a try (a suggestion from @stephentoub from a while ago)?
cc @stephentoub @kouvel @tannergooding @tmds @benaadams