Open Licenser opened 4 years ago
@Licenser something to consider though is that while NUMA architectures have seen an uptake for a while, more recent AMD CPUs such as the Ryzen 3 operate in a single NUMA domain thanks to the improved Infinity Fabric.
I don't know how much of a trend this is / how long it'll be before this is common. But it's probably interesting to know that this is an issue that might solve itself as time goes on, simply because at the upper end of the market the architectures are shifting.
They do operate in a single numa domain by default, which is fine for many applications, internally they are still 4 NUMA domains and you can change the number of NUMA domains. Even then the 4 domains are not entirely 'honest' as there isn't always a single shared cache within a domain. The impact here will mostly be for workloads that fit into cache - given we are at 100+MB by now I can see them becoming more frequent.
So basically while the Ryzen 3 pretends to be a single NUMA node, and when accessing memory the difference is minimal, it still are 4 or 8 domains internally. I suspect (and this is speculation) that for many some workloads, especially where performance is concerned, the mode is switched back to exposing all NUMA domains.
On the second bit, yes my guess is as good as mine, then again given the current success I think it's going to stay and evolve for a while.
Hello there! We have had a solution for this case in Bastion. Migrations are not allowed and we are using NUMA aware allocator as unstable feature in Bastion. NUMA discussion was going on in the wg-allocators
too and I made a comment about this to there too.
That being said, in systems with a high throughput of multicore-processing with a different type of cached data (even the cases with message passing, which is most rust projects) this enables faster processing.
We both pin at the OS level and allocate wisely when unstable
API exposed. That being said combination of MIMD on Intel Xeon Phi and NUMA aware execution like we have will be a monster devouring your data.
Zen and Zen2 are both comprised of 1+ CCXs with 4 cores each (some have disabled cores to not waste imperfect, yet still useful chips). L3 is shared within a single CCX, but not (or only with significant Infinity Fabric latency) between multiple CCXs. Intel iirc has L2 shared between core pairs, and L3 shared per chip.
While latency is indeed tolerable on single-die Zen 2 systems, unfair cacheline bouncing is still a big issue in cmp_exchange
-based algorithms. Using non-NUMA-aware algorithms that rely on cmp_exchange in contented scenarios will cause problems.
I think keeping tasks on the same core could be extremely important.
I wrote a toy program that connects 500 threads with pipes, and then drops a byte in one end and measures how long it takes to appear at the other end. So it's measuring single-byte reads and writes on a pipe, and context switches. A friend ran it on their 40-core Intel machine, and pinning it to a single core made it run ten times faster:
Linux 4.15.0-91-generic
40-way machine (2 sockets * 10 cores * 2 hyperthreads, Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz)
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-9,20-29 <-- the "1 socket" pinned to below
NUMA node1 CPU(s): 10-19,30-39
thread: 19.09user 151.84system 2:41.81elapsed 105%CPU (0avgtext+0avgdata 44068maxresident)k
pinned to 1 socket, thread: 16.07user 111.84system 1:52.92elapsed 113%CPU (0avgtext+0avgdata 43408maxresident)k
pinned to 1 core, thread: 5.73user 28.37system 0:26.16elapsed 130%CPU (0avgtext+0avgdata 43868maxresident)k
pinned to 1 vCPU, thread: 3.38user 13.09system 0:16.58elapsed 99%CPU (0avgtext+0avgdata 44040maxresident)k
async: 12.06user 10.69system 0:22.78elapsed 99%CPU (0avgtext+0avgdata 43720maxresident)k
pinned async: 11.48user 11.09system 0:22.59elapsed 99%CPU (0avgtext+0avgdata 43484maxresident)k
Granted, the Xeon E5-2680 came out in 2012...
If that helps, I've gathered in-depth research on NUMA-aware solutions to task scheduling, see https://github.com/numforge/laser/blob/master/research/runtime_threads_tasks_allocation_NUMA.md (and search for NUMA as the list is long).
In particular, highlights from my own NUMA issue:
PhD Thesis on NUMA aware scheduling: https://pdfs.semanticscholar.org/a0ab/00a23377f333ca4c34dac2b74abc5af6ca25.pdf
Nabbit-C, extends Nabbit (Cilk/CilkPlus based task dependency) with locality information: https://www.cse.wustl.edu/~kunal/resources/Papers/nabbit-c.pdf
I think keeping tasks on the same core could be extremely important.
You lose load balancing. Tasks can be pinned on the NUMA domain instead (though I don't think pinning is allowed on MacOS)
You lose load balancing.
Oh, true! Is it better to say, you want as little migration as possible while still keeping cores busy?
You lose load balancing.
Oh, true! Is it better to say, you want as little migration as possible while still keeping cores busy?
Yes, basically the NUMA issue is one of locality:
So to handle all NUMA problems you need to handle both memory and scheduling.
First of all this isn't a solution or a request for a specific feature but rather I'd like to kick of an discussion in the hope that it has some fruitful results.
Disclaimer
I did dig into this topic a bit after experiencing very odd performance on a Ryzen 3000 systems and am somewhat fascinated by how the chaplet architecture affects performance. The observation I made had to do with how cores communicate but go hand in hand on how NUMA or multi CPU systems should behave - but I do not have a multi core system at hand to verify this.
What
With the task scheduler might spawn and schedule tasks on different threads it will be affected by NUMA architectures. It would be interesting and likely beneficial to performance, to explore how the schedule can take architecture like this into account.
The biggest impact I could see is moving tasks from one to another core to another or communicating between tasks on different cores.
Why
Non single die systems become more prominent, with Ryzen multi die cpu's that to a degree are NUMA systems, have started to appear in community hardware. However those considerations will probably apply to multi cpu systems as well.
Aside of memory access, which might not affect tasks as much, the impact of not having a shared cache can be huge. I've seen it being in the excess of double (or half depending on how you look) performance difference when having shared or not having shared.
For numbers - pinning threads to cores that share a cache has moved throughput in a benchmark from 135MB/s to over 400MB/s.
Taking cache invalidation into consideration when communication between non cache sharing cores share crossbeam channels get up to 2x faster (https://github.com/crossbeam-rs/crossbeam/pull/462) in some scenarios.
Just a few thoughts :)