[discossuion] Contemplating task schedulers on NUMA systems

Licenser commented 4 years ago

First of all this isn't a solution or a request for a specific feature but rather I'd like to kick of an discussion in the hope that it has some fruitful results.

Disclaimer

I did dig into this topic a bit after experiencing very odd performance on a Ryzen 3000 systems and am somewhat fascinated by how the chaplet architecture affects performance. The observation I made had to do with how cores communicate but go hand in hand on how NUMA or multi CPU systems should behave - but I do not have a multi core system at hand to verify this.

What

With the task scheduler might spawn and schedule tasks on different threads it will be affected by NUMA architectures. It would be interesting and likely beneficial to performance, to explore how the schedule can take architecture like this into account.

The biggest impact I could see is moving tasks from one to another core to another or communicating between tasks on different cores.

Why

Non single die systems become more prominent, with Ryzen multi die cpu's that to a degree are NUMA systems, have started to appear in community hardware. However those considerations will probably apply to multi cpu systems as well.

Aside of memory access, which might not affect tasks as much, the impact of not having a shared cache can be huge. I've seen it being in the excess of double (or half depending on how you look) performance difference when having shared or not having shared.

For numbers - pinning threads to cores that share a cache has moved throughput in a benchmark from 135MB/s to over 400MB/s.

Taking cache invalidation into consideration when communication between non cache sharing cores share crossbeam channels get up to 2x faster (https://github.com/crossbeam-rs/crossbeam/pull/462) in some scenarios.

Just a few thoughts :)

yoshuawuyts commented 4 years ago

@Licenser something to consider though is that while NUMA architectures have seen an uptake for a while, more recent AMD CPUs such as the Ryzen 3 operate in a single NUMA domain thanks to the improved Infinity Fabric.

I don't know how much of a trend this is / how long it'll be before this is common. But it's probably interesting to know that this is an issue that might solve itself as time goes on, simply because at the upper end of the market the architectures are shifting.

Licenser commented 4 years ago

They do operate in a single numa domain by default, which is fine for many applications, internally they are still 4 NUMA domains and you can change the number of NUMA domains. Even then the 4 domains are not entirely 'honest' as there isn't always a single shared cache within a domain. The impact here will mostly be for workloads that fit into cache - given we are at 100+MB by now I can see them becoming more frequent.

So basically while the Ryzen 3 pretends to be a single NUMA node, and when accessing memory the difference is minimal, it still are 4 or 8 domains internally. I suspect (and this is speculation) that for many some workloads, especially where performance is concerned, the mode is switched back to exposing all NUMA domains.

On the second bit, yes my guess is as good as mine, then again given the current success I think it's going to stay and evolve for a while.

vertexclique commented 4 years ago

Hello there! We have had a solution for this case in Bastion. Migrations are not allowed and we are using NUMA aware allocator as unstable feature in Bastion. NUMA discussion was going on in the wg-allocators too and I made a comment about this to there too.

That being said, in systems with a high throughput of multicore-processing with a different type of cached data (even the cases with message passing, which is most rust projects) this enables faster processing.

We both pin at the OS level and allocate wisely when unstable API exposed. That being said combination of MIMD on Intel Xeon Phi and NUMA aware execution like we have will be a monster devouring your data.

namibj commented 4 years ago

Zen and Zen2 are both comprised of 1+ CCXs with 4 cores each (some have disabled cores to not waste imperfect, yet still useful chips). L3 is shared within a single CCX, but not (or only with significant Infinity Fabric latency) between multiple CCXs. Intel iirc has L2 shared between core pairs, and L3 shared per chip.

While latency is indeed tolerable on single-die Zen 2 systems, unfair cacheline bouncing is still a big issue in cmp_exchange-based algorithms. Using non-NUMA-aware algorithms that rely on cmp_exchange in contented scenarios will cause problems.

jimblandy commented 4 years ago

I think keeping tasks on the same core could be extremely important.

I wrote a toy program that connects 500 threads with pipes, and then drops a byte in one end and measures how long it takes to appear at the other end. So it's measuring single-byte reads and writes on a pipe, and context switches. A friend ran it on their 40-core Intel machine, and pinning it to a single core made it run ten times faster:

  Linux 4.15.0-91-generic
  40-way machine (2 sockets * 10 cores * 2 hyperthreads, Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz)
  L1d cache:           32K
  L1i cache:           32K
  L2 cache:            256K
  L3 cache:            25600K
  NUMA node0 CPU(s):   0-9,20-29    <-- the "1 socket" pinned to below
  NUMA node1 CPU(s):   10-19,30-39

                       thread: 19.09user  151.84system  2:41.81elapsed 105%CPU (0avgtext+0avgdata 44068maxresident)k
   pinned to 1 socket, thread: 16.07user  111.84system  1:52.92elapsed 113%CPU (0avgtext+0avgdata 43408maxresident)k
     pinned to 1 core, thread:  5.73user   28.37system  0:26.16elapsed 130%CPU (0avgtext+0avgdata 43868maxresident)k
     pinned to 1 vCPU, thread:  3.38user   13.09system  0:16.58elapsed  99%CPU (0avgtext+0avgdata 44040maxresident)k

                        async: 12.06user   10.69system  0:22.78elapsed  99%CPU (0avgtext+0avgdata 43720maxresident)k
                 pinned async: 11.48user   11.09system  0:22.59elapsed  99%CPU (0avgtext+0avgdata 43484maxresident)k

Granted, the Xeon E5-2680 came out in 2012...

mratsim commented 4 years ago

If that helps, I've gathered in-depth research on NUMA-aware solutions to task scheduling, see https://github.com/numforge/laser/blob/master/research/runtime_threads_tasks_allocation_NUMA.md (and search for NUMA as the list is long).

In particular, highlights from my own NUMA issue:

PhD Thesis on NUMA aware scheduling: https://pdfs.semanticscholar.org/a0ab/00a23377f333ca4c34dac2b74abc5af6ca25.pdf

Nabbit-C, extends Nabbit (Cilk/CilkPlus based task dependency) with locality information: https://www.cse.wustl.edu/~kunal/resources/Papers/nabbit-c.pdf

I think keeping tasks on the same core could be extremely important.

You lose load balancing. Tasks can be pinned on the NUMA domain instead (though I don't think pinning is allowed on MacOS)

jimblandy commented 4 years ago

You lose load balancing.

Oh, true! Is it better to say, you want as little migration as possible while still keeping cores busy?

mratsim commented 4 years ago

You lose load balancing.

Oh, true! Is it better to say, you want as little migration as possible while still keeping cores busy?

Yes, basically the NUMA issue is one of locality:

CPUs have different memory affinity.
A task or thread may have CPU affinity (if the scheduler allows it).
A task or thread that allocate memory will allocate to the memory region with the most affinity with current CPU core (first-touch policy).
The OS can migrate a thread and the task running on it to another core and suddenly memory accesses are delayed.
The problem gets worse if the task is operating on pre-allocated shared memory because it might also be "remote".

So to handle all NUMA problems you need to handle both memory and scheduling.

For scheduling, pinning thread to cores is the easiest if your OS allows it. And then your scheduler is free of OS interferences.
For memory, I didn't implement it yet but if there are:
- a "numaQuery" to check which NUMA domain a pointer points to that would help a scheduler for decisions
- and a "memAdvise" (CUDA terminology) for NUMA that allows migrating/remapping the physical memory without invalidating existing pointers, that would help users move physical memory if the task is so memory-bound that it's worth it.

async-rs / async-std

[discossuion] Contemplating task schedulers on NUMA systems #686

Disclaimer

What

Why