Allow grouping processors by memory domains

giordano commented 1 year ago

In the same vein as #35, grouping processors by memory domains in scaling benchmarks would allow an easier comparison with https://github.com/RRZE-HPC/TheBandwidthBenchmark. Example: Fujitsu FX700 S1 M4 C48.

carstenbauer commented 1 year ago

I might have to take a closer look but that's already possible. pinthreads(:compact; places=:numa) pins Julia threads to memory domains: first, the first memory domain is filled, then the next, and so on. Need to check whether it is compatible with hyperthreads=false, which they use in their benchmark. Also, on A64FX one might need to be careful to exclude the "assistant cores" or whatever they are called. Need to check how they exclude them with likwid pin.

carstenbauer commented 1 year ago

What would be nice to have is support for all likwid-pin compatible pinning specifications such that one could just pass the likwid-pin string into pinthreads. But that's obviously much more ambitious.

carstenbauer commented 1 year ago

Note to self E:M0:$nt:1:2 (block size=1, stride=2) in LIKWID means the following: Within the first memory/NUMA domain distribute $nt threads to the domain by placing one (==block size) software thread on every other (==stride) hardware thread (i.e. including hyperthreads).

carstenbauer commented 1 year ago

Ok, so in principal you can just use pinthreads(:compact) and would get the same pinning that they use (if you vary the number of threads between 1 and 12, which is the size of a memory domain). However, the real "difficulty" on Fugaku is the existence of "extra cores". According to the lscpu output that you've sent me some time ago these are cores 0 and 1, which we should ignore. Unfortunately, this means we can't use the built-in strategies anymore and need to specify the cpu ids manually (Don't know if/how LIKWID handles this), i.e., pinthreads(12:23) (if we have 12 threads). This should give the following on Fugaku:

julia> threadinfo(; groupby=:numa, color=false)

| _ |
| _ |
| 12,13,14,15,16,17,18,19,20,21,22,23 |
| _,_,_,_,_,_,_,_,_,_,_,_ |
| _,_,_,_,_,_,_,_,_,_,_,_ |
| _,_,_,_,_,_,_,_,_,_,_,_ |

# = Julia thread, | = NUMA seperator

Julia threads: 12
├ Occupied CPU-threads: 12
└ Mapping (Thread => CPUID): 1 => 12, 2 => 13, 3 => 14, 4 => 15, 5 => 16, ...

giordano commented 1 year ago

To be clear, I think I opened also this issue in the wrong repository. https://github.com/giordano/julia-on-fugaku/blob/cd5795aa746ec83286dc5d82aefdde50c56f74a3/benchmarks/bandwidthbenchmarkjl/bwscaling.pdf was already obtained with pinthreads(:compact; places=:numa), see https://github.com/giordano/julia-on-fugaku/blob/cd5795aa746ec83286dc5d82aefdde50c56f74a3/benchmarks/bandwidthbenchmarkjl/bench.jl#L6, what I was asking was to get the output of bwscaling grouped by memory regions like https://github.com/RRZE-HPC/TheBandwidthBenchmark/wiki/Fujitsu-FX700-S1-M4-C48. You kinda see the pattern of the four regions when you get to more than 12 cores, but the plots aren't immediately comparable.

carstenbauer commented 1 year ago

Ah, I think I know what you mean. Yes, that discussion should go into the BandwidthBenchmark.jl repo. In any case, that grouping is not really a "minor extension" of bwscaling because the latter doesn't currently know anything about the pinning of the Julia threads or memory regions: It doesn't know that threads are spread between memory domains and also not how many memory domains there are and what size they have. Hence, it can't know how to group the linear data.

Maybe I can add what you want as a separate function (a version of bwscaling that takes care of the pinning etc.). However, given that you've pinned the threads correctly, it's just a post-processing step, i.e., a refactoring of the data that bwscaling already outputs. So it should only take you a few minutes I guess 😉

In any case, in the context of this repo, the only interesting thing is how to handle "extra cores" that should be excluded. But I'll track that in a new issue (#38) since it is also much more general (M1 has different categories of cores, some modern Intel CPUs as well).

(BTW, It seems that you haven't excluded the extra cores in giordano/julia-on-fugaku@cd5795a/benchmarks/bandwidthbenchmarkjl/bwscaling.pdf, or have you? I would have naively expected that to be visible in the results, but what do I know 🙂)

giordano commented 1 year ago

(BTW, It seems that you haven't excluded the extra cores in giordano/julia-on-fugaku@cd5795a/benchmarks/bandwidthbenchmarkjl/bwscaling.pdf, or have you? I would have naively expected that to be visible in the results, but what do I know slightly_smiling_face)

Ookami nodes have 48 cores, it's only Fugaku which has the 2 extra cores reserved to the operating system, or the chips with 2.2 GHz of frequency anyway according to the datasheet (I know only Fugaku with the higher-powered CPUs, the other systems I have access to use the 1.8 GHz chips).

carstenbauer commented 1 year ago

Ah, thanks for the info. (Would be great to get access to Ookami so that I can inspect things myself + run a few tests on it.)

carstenbauer / ThreadPinning.jl

Allow grouping processors by memory domains #36