MPI Support (new attempt)

carstenbauer commented 1 month ago

TODO:

[x] mpi_getcpuids
[x] mpi_gethostnames
[x] mpi_getlocalrank
[x] mpi_pinthreads
[x] MPI testing (CI)
[x] example in docs

Changelog:

renamed pinthreads_mpi to mpi_pinthreads
new public API mpi_getcpuids, mpi_gethostnames, mpi_getlocalrank (based on explicit communication/exchange between MPI ranks)

(cc @jagot)

Closes #61

carstenbauer commented 1 month ago

Basic multi-node test (of mpi_pinthreads(:numa)) was successful. (See https://carstenbauer.github.io/ThreadPinning.jl/previews/PR99/examples/ex_mpi/)

NUMA node 1: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143]
NUMA node 2: [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159]
NUMA node 3: [32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175]
NUMA node 4: [48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191]
NUMA node 5: [64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207]
NUMA node 6: [80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223]
NUMA node 7: [96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239]
NUMA node 8: [112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255]

BEFORE: Where are the Julia threads of the MPI ranks running?
    rank 0 is running 2 Julia threads on the CPU-threads [127, 253] of node nid004406
    rank 1 is running 2 Julia threads on the CPU-threads [113, 182] of node nid004406
    rank 2 is running 2 Julia threads on the CPU-threads [109, 138] of node nid004406
    rank 3 is running 2 Julia threads on the CPU-threads [26, 115] of node nid004406
    rank 4 is running 2 Julia threads on the CPU-threads [4, 146] of node nid005218
    rank 5 is running 2 Julia threads on the CPU-threads [101, 236] of node nid005218
    rank 6 is running 2 Julia threads on the CPU-threads [42, 255] of node nid005218
    rank 7 is running 2 Julia threads on the CPU-threads [8, 90] of node nid005218
    rank 8 is running 2 Julia threads on the CPU-threads [80, 237] of node nid005908
    rank 9 is running 2 Julia threads on the CPU-threads [23, 198] of node nid005908
    rank 10 is running 2 Julia threads on the CPU-threads [5, 47] of node nid005908
    rank 11 is running 2 Julia threads on the CPU-threads [54, 26] of node nid005908
    rank 12 is running 2 Julia threads on the CPU-threads [42, 143] of node nid005915
    rank 13 is running 2 Julia threads on the CPU-threads [8, 120] of node nid005915
    rank 14 is running 2 Julia threads on the CPU-threads [238, 217] of node nid005915
    rank 15 is running 2 Julia threads on the CPU-threads [27, 159] of node nid005915

AFTER: Where are the Julia threads of the MPI ranks running?
    rank 0 is running 2 Julia threads on the CPU-threads [0, 1] of node nid004406
    rank 1 is running 2 Julia threads on the CPU-threads [16, 17] of node nid004406
    rank 2 is running 2 Julia threads on the CPU-threads [32, 33] of node nid004406
    rank 3 is running 2 Julia threads on the CPU-threads [48, 49] of node nid004406
    rank 4 is running 2 Julia threads on the CPU-threads [0, 1] of node nid005218
    rank 5 is running 2 Julia threads on the CPU-threads [16, 17] of node nid005218
    rank 6 is running 2 Julia threads on the CPU-threads [32, 33] of node nid005218
    rank 7 is running 2 Julia threads on the CPU-threads [48, 49] of node nid005218
    rank 8 is running 2 Julia threads on the CPU-threads [0, 1] of node nid005908
    rank 9 is running 2 Julia threads on the CPU-threads [16, 17] of node nid005908
    rank 10 is running 2 Julia threads on the CPU-threads [32, 33] of node nid005908
    rank 11 is running 2 Julia threads on the CPU-threads [48, 49] of node nid005908
    rank 12 is running 2 Julia threads on the CPU-threads [0, 1] of node nid005915
    rank 13 is running 2 Julia threads on the CPU-threads [16, 17] of node nid005915
    rank 14 is running 2 Julia threads on the CPU-threads [32, 33] of node nid005915
    rank 15 is running 2 Julia threads on the CPU-threads [48, 49] of node nid005915

carstenbauer commented 1 month ago

cc @sloede (because you were interested in this at some point)

carstenbauer commented 1 month ago

I plan to merge this tomorrow. @jagot are you happy with this PR or is something missing?

jagot commented 1 month ago

Sorry for the late reply; my vacation started this week, and I obviously decided to tear down my kitchen. I will have a look tomorrow!

jagot commented 1 month ago

I think the API looks nice, but it does not work as I would naïvely expect/hope; I ran the example from the documentation, adding a call to threadinfo() to really see the thread distribution after pinning (I modified the output in case of no colour, see here https://github.com/jagot/ThreadPinning.jl/tree/improve-colorless-output):

Job official-example ID 183709.bossy on 2 nodes, 2 processes, and 48 threads per process
Host: janeway16
  Activating project at `~/projects/mpi-test`
  Activating project at `~/projects/mpi-test`
NUMA node 1: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71]
NUMA node 2: [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95]

BEFORE: Where are the Julia threads of the MPI ranks running?
    rank 0 is running 48 Julia threads on the CPU-threads [95, 37, 41, 43, 72, 35, 46, 78, 38, 39, 26, 28, 44, 82, 93, 30, 29, 81, 88, 90, 75, 47, 31, 84, 76, 25, 34, 40, 89, 45, 77, 32, 85, 92, 45, 24, 36, 79, 86, 33, 83, 87, 73, 85, 75, 89, 80, 0] of node janeway16
    rank 1 is running 48 Julia threads on the CPU-threads [48, 75, 2, 83, 88, 92, 46, 38, 73, 90, 28, 78, 33, 84, 74, 41, 91, 45, 77, 86, 25, 1, 82, 42, 36, 26, 87, 35, 29, 79, 43, 47, 34, 85, 76, 44, 31, 75, 73, 32, 0, 30, 39, 44, 93, 80, 94, 72] of node janeway17

AFTER: Where are the Julia threads of the MPI ranks running?
    rank 0 is running 48 Julia threads on the CPU-threads [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71] of node janeway16
Hostname:   janeway16
CPU(s):     2 x AMD EPYC 7352 24-Core Processor
CPU target:     znver2
Cores:      48 (96 CPU-threads due to 2-way SMT)
NUMA domains:   2 (24 cores each)

Julia threads:  48

CPU socket 1
  0,48h, 1,49h, 2,50h, 3,51h, 4,52h, 5,53h, 6,54h, 7,55h,
  8,56h, 9,57h, 10,58h, 11,59h, 12,60h, 13,61h, 14,62h, 15,63h,
  16,64h, 17,65h, 18,66h, 19,67h, 20,68h, 21,69h, 22,70h, 23,71h

CPU socket 2
  _,_, _,_, _,_, _,_, _,_, _,_, _,_, _,_,
  _,_, _,_, _,_, _,_, _,_, _,_, _,_, _,_,
  _,_, _,_, _,_, _,_, _,_, _,_, _,_, _,_

# = Julia thread, h = Julia thread on HT, ! = >1 Julia thread

(Mapping: 1 => 0, 2 => 1, 3 => 2, 4 => 3, 5 => 4, ...)
    rank 1 is running 48 Julia threads on the CPU-threads [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71] of node janeway17
Hostname:   janeway16
CPU(s):     2 x AMD EPYC 7352 24-Core Processor
CPU target:     znver2
Cores:      48 (96 CPU-threads due to 2-way SMT)
NUMA domains:   2 (24 cores each)

Julia threads:  48

CPU socket 1
  0,48h, 1,49h, 2,50h, 3,51h, 4,52h, 5,53h, 6,54h, 7,55h,
  8,56h, 9,57h, 10,58h, 11,59h, 12,60h, 13,61h, 14,62h, 15,63h,
  16,64h, 17,65h, 18,66h, 19,67h, 20,68h, 21,69h, 22,70h, 23,71h

CPU socket 2
  _,_, _,_, _,_, _,_, _,_, _,_, _,_, _,_,
  _,_, _,_, _,_, _,_, _,_, _,_, _,_, _,_,
  _,_, _,_, _,_, _,_, _,_, _,_, _,_, _,_

# = Julia thread, h = Julia thread on HT, ! = >1 Julia thread

(Mapping: 1 => 0, 2 => 1, 3 => 2, 4 => 3, 5 => 4, ...)

i.e., there are two MPI ranks, each running on a separate node. The problem is that all threads are pinned to the first socket only (on these machines, the socket and NUMA are equivalent), leaving the other socket un-utilized, and half of the threads are pinned to hyperthreads.

How do I achieve pinning across all sockets? Should there be a :node alternative?

carstenbauer commented 1 month ago

You're running on two nodes each hosting one MPI rank with 48 Julia threads. What mpi_pinthreads(:numa) (or mpi_pinthreads(:sockets) because NUMA domains and sockets fall together) does is distributing MPI ranks (that is all of their Julia threads) in a round-robin fashion among NUMA domains on each node. Since their is only a single MPI rank per node, both MPI ranks (that is their 48 Julia threads) simply get assigned to the first NUMA domain of their respective nodes.

Why are threads pinned to hyperthreads? Because 48 threads only fit into the NUMA domain if hyperthreads are included. If you don't want hyperthreads involved, choose 24 threads per MPI rank instead of 48.

How can I also occupy the second NUMA domain on each node? The simple answer is: Use 2 MPI ranks per node. More generally speaking, choose as many MPI ranks per node as there are NUMA domains. This is what one typically does/is advised to do anyways for MPI applications, irrespective of the programming language.

What if I don't want to use 2 MPI ranks per node? Well, you probably should 😉. More seriously, if you only have a single MPI rank per node, you don't need any of the mpi_pinthreads business anyways (because there is no conflict between ranks). Just call pinthreads(:numa) on each MPI rank and you're done.

(We could imagine a more complicated case though. Say we have 2 MPI ranks per node but a system with 4 NUMA domains per socket. We might want to distribute MPI ranks among sockets but, within each socket, distribute the Julia threads of the MPI rank among NUMA domains. This is currently not available out of the box but one could imagine a mpi_pinthreads(:sockets, :numa) variant.)

carstenbauer commented 1 month ago

BTW, threadinfo() adds little gaps (i.e. spaces) between CPU IDs that don't belong to the same core (look closely and compare to threadinfo(; coregaps=false)). This should also be the case for threadinfo(; color=false) and allows you to identify hyperthreads without adding "h"s.

jagot commented 1 month ago

Thanks for the detailed writeup! I am relearning MPI, having not used it for a long while.

I still think it is a valid use-case to have only one MPI rank per node, if nothing else to be able to compare. AFAIU, architectures vary, and for some it is/should be beneficial to utilize all cores within the same SMP program.

pinthreads indeed does the trick, but for that to be applicable, the code needs to be aware that it is alone on a particular node, which is what I implemented in https://github.com/carstenbauer/ThreadPinning.jl/issues/61#issuecomment-2267110936. That logic could easily be baked into mpi_getlocalrank https://github.com/carstenbauer/ThreadPinning.jl/blob/345cd1be8760ea218cb056a03b48664620e3e912/ext/MPIExt/mpi_querying.jl#L55-L65 if we so wish, or it could be up to the user to do something like

using ThreadPinning
using MPI

function mpi_alone_on_this_node(;comm=MPI.COMM_WORLD, dest=0)
    rank = MPI.Comm_rank(comm)
    num_ranks = MPI.Comm_size(comm)

    hostname = gethostname()
    all_hostnames = MPI.gather(hostname, comm; root=dest)

    if rank == 0
        hostname_unique = Dict(h => count(==(h), all_hostnames) == 1
            for h in unique(all_hostnames))
        for (i,h) in enumerate(all_hostnames)
            i == 1 && continue
            MPI.send(hostname_unique[h], comm, dest=i-1)
        end
        hostname_unique[hostname]
    else
        MPI.recv(comm, source=dest)
    end
end

if mpi_alone_on_this_node()
    pinthreads(:numa)
else
    mpi_pinthreads(:numa)
end

I would prefer the former, I think.

carstenbauer commented 1 month ago

I would prefer the former, I think.

I think I don't. Not because 1 MPI rank per node is unreasonable (it isn't) but because it would make the mpi_pinthreads API confusing. What you're proposing is to do one thing if you're not alone on a node and a different thing if you're not.

Currently, the semantics of mpi_pinthreads is simple: The :numa in mpi_pinthreads(:numa) means that MPI ranks (with all their threads as "one unit") will be distributed among NUMA domains. It does not mean that the threads of a MPI rank get distributed among different NUMA domains. But that's what you want it do in the case where the MPI rank is alone on the node. While I believe to see, where you're coming from, this seems pretty arbitrary and confusing to me.

jagot commented 1 month ago

Fair point. I will play around, having the above function in my user code for the time being. When/if something solidifies, and it is useful to anyone but me, we can revisit this issue.

carstenbauer / ThreadPinning.jl

MPI Support (new attempt) #99