Interplay with MPI - Githubissues

giordano commented 1 year ago

It'd be useful to at least document how to use ThreadPinning.jl in hybrid MPI+multithreaded applications.

In an offline chat @carstenbauer suggested

Let's say you have 2 sockets per node and want to use 1 MPI rank per socket with 5 threads each (say, compactly pinned to the first N physical cores in that socket). This you should get with pinthreads(socket(mpi_rank+1, 1:5)) where mpi_rank = MPI.Comm_rank(MPI.COMM_WORLD) or whatever.

giordano commented 1 year ago

For the record, when running a job with one MPI rank per NUMA domain in each node, this should help:

using MPI, ThreadPinning
MPI.Init()
comm = MPI.COMM_WORLD
my_rank = MPI.Comm_rank(comm)
pinthreads( numa( my_rank+1, 1:Threads.nthreads() ) )

sloede commented 1 year ago

For the record, when running a job with one MPI rank per NUMA domain in each node, this should help:
using MPI, ThreadPinning
MPI.Init()
comm = MPI.COMM_WORLD
my_rank = MPI.Comm_rank(comm)
pinthreads( numa( my_rank+1, 1:Threads.nthreads() ) )

This only works on a single node though, right? If you want to run this on multiple nodes, you first need to get the local rank with something like

using MPI, ThreadPinning
MPI.Init()
comm = MPI.COMM_WORLD
my_rank = MPI.Comm_rank(comm)
nranks = MPI.Comm_size(comm)
nranks_per_node = div(nranks, nnodes) # nnodes needs to be computed a priori from cluster manager knowledge
my_local_rank = my_rank % nranks_per_node
pinthreads( numa( my_local_rank+1, 1:Threads.nthreads() ) )

Or am I missing something?

sloede commented 1 year ago

Here's a small helper function that distributes threads among the cores of a node. The first thread of each rank is distributed with an even stride among the cores of the node, and the rank-local threads follow compactly. Maybe this can be the nucleus of some MPI-aware threadpinning documentation (or even functionality)?

using MPI
using ThreadPinning

function mpi_thread_pinning(nnodes; mpi_comm=MPI.COMM_WORLD,
                                    mpi_rank=MPI.Comm_rank(mpi_comm),
                                    mpi_size=MPI.Comm_size(mpi_comm))
  # Determine number of ranks per node
  @assert mpi_size % nnodes == 0 "Cannot evenly distribute $mpi_size MPI ranks among $nnodes nodes"
  nranks_per_node = div(mpi_size, nnodes)

  # Determine number of threads per node
  nthreads_per_node = nranks_per_node * Threads.nthreads()
  @assert nthreads_per_node <= ncores() "Hyperthreading support not implemented (more threads than cores)"

  # Compute positions of threads
  local_rank = mpi_rank % nranks_per_node
  stride = div(ncores(), nranks_per_node)
  range = collect(1:Threads.nthreads()) .+ (local_rank * stride)

  # Pin threads
  return pinthreads(node(range))
end

Note that the MPI rank/size info is given as kwargs such that you can just override it for local testing.

carstenbauer commented 1 year ago

You might want to take a look at https://github.com/carstenbauer/ThreadPinning.jl/pull/54 which has pinthreads(:affinitymask) which respects the external affinity mask which is typically set by SLURM (on all nodes) automatically. Within the affinity mask it pins compactly, ignoring hyperthreads by default.

In this PR, I also added a draft version (currently single-node) of an pinthreads_mpi function. The idea behind this function is to, without relying on external affinity masks, make it easier to explicitly pin MPI ranks (i.e. their threads) on multiple nodes. So, it is (supposed to be) similar to what you're doing here. Note, though, that I would like to avoid MPI.jl as an explicit dependency, if possible (maybe a weak-dependency will do).

sloede commented 1 year ago

Ah, I didn't know about #54. That's looking like a great contribution to TP.jl! Please note that some systems do not use SLURM but PBS & friends, and thus have to rely on the user manually configuring the rank/thread placement using, e.g., omplace or some other mechanism.

Therefore, it would be great if there will also be support for manually specifying the thread placement. At least for typical cases like evenly spreading out MPI ranks on a node, and placing the threads on consecutive cores (with or w/o hyperthreading enabled), could easily be support without having to rely on MPI.jl by just accepting a node-local rank and the total number of ranks on this node as arguments. I don't think it would be necessary to support all craziness from the outset, such as an odd number of ranks per node or varying number of threads per rank etc. These can be added later on an as-needed basis.

carstenbauer commented 1 year ago

FYI, I just merged #54. However, I did not extend pinthreads_mpi for manual pinning to multi-node cases (which @sloede is interested in). Frankly, I also have little motivation and incentive to work on this because I almost exclusively use MPI on systems that use SLURM. Not saying it's not going to happen, just that it's low priority for me.

@sloede, perhaps you'd be willing to work on improving pinthreads_mpi yourself? -> #61

In any case, since @giordano is probably already satisfied with the new pinthreads(:affinitymask) (correct me if I'm wrong on this) I'm going to close this issue and open a new one specifically for pinthreads_mpi (#61).

carstenbauer / ThreadPinning.jl

Interplay with MPI #51