Closed giordano closed 1 year ago
For the record, when running a job with one MPI rank per NUMA domain in each node, this should help:
using MPI, ThreadPinning
MPI.Init()
comm = MPI.COMM_WORLD
my_rank = MPI.Comm_rank(comm)
pinthreads( numa( my_rank+1, 1:Threads.nthreads() ) )
For the record, when running a job with one MPI rank per NUMA domain in each node, this should help:
using MPI, ThreadPinning MPI.Init() comm = MPI.COMM_WORLD my_rank = MPI.Comm_rank(comm) pinthreads( numa( my_rank+1, 1:Threads.nthreads() ) )
This only works on a single node though, right? If you want to run this on multiple nodes, you first need to get the local rank with something like
using MPI, ThreadPinning
MPI.Init()
comm = MPI.COMM_WORLD
my_rank = MPI.Comm_rank(comm)
nranks = MPI.Comm_size(comm)
nranks_per_node = div(nranks, nnodes) # nnodes needs to be computed a priori from cluster manager knowledge
my_local_rank = my_rank % nranks_per_node
pinthreads( numa( my_local_rank+1, 1:Threads.nthreads() ) )
Or am I missing something?
Here's a small helper function that distributes threads among the cores of a node. The first thread of each rank is distributed with an even stride among the cores of the node, and the rank-local threads follow compactly. Maybe this can be the nucleus of some MPI-aware threadpinning documentation (or even functionality)?
using MPI
using ThreadPinning
function mpi_thread_pinning(nnodes; mpi_comm=MPI.COMM_WORLD,
mpi_rank=MPI.Comm_rank(mpi_comm),
mpi_size=MPI.Comm_size(mpi_comm))
# Determine number of ranks per node
@assert mpi_size % nnodes == 0 "Cannot evenly distribute $mpi_size MPI ranks among $nnodes nodes"
nranks_per_node = div(mpi_size, nnodes)
# Determine number of threads per node
nthreads_per_node = nranks_per_node * Threads.nthreads()
@assert nthreads_per_node <= ncores() "Hyperthreading support not implemented (more threads than cores)"
# Compute positions of threads
local_rank = mpi_rank % nranks_per_node
stride = div(ncores(), nranks_per_node)
range = collect(1:Threads.nthreads()) .+ (local_rank * stride)
# Pin threads
return pinthreads(node(range))
end
Note that the MPI rank/size info is given as kwargs such that you can just override it for local testing.
You might want to take a look at https://github.com/carstenbauer/ThreadPinning.jl/pull/54 which has pinthreads(:affinitymask)
which respects the external affinity mask which is typically set by SLURM (on all nodes) automatically. Within the affinity mask it pins compactly, ignoring hyperthreads by default.
In this PR, I also added a draft version (currently single-node) of an pinthreads_mpi
function. The idea behind this function is to, without relying on external affinity masks, make it easier to explicitly pin MPI ranks (i.e. their threads) on multiple nodes. So, it is (supposed to be) similar to what you're doing here. Note, though, that I would like to avoid MPI.jl as an explicit dependency, if possible (maybe a weak-dependency will do).
Ah, I didn't know about #54. That's looking like a great contribution to TP.jl! Please note that some systems do not use SLURM but PBS & friends, and thus have to rely on the user manually configuring the rank/thread placement using, e.g., omplace
or some other mechanism.
Therefore, it would be great if there will also be support for manually specifying the thread placement. At least for typical cases like evenly spreading out MPI ranks on a node, and placing the threads on consecutive cores (with or w/o hyperthreading enabled), could easily be support without having to rely on MPI.jl by just accepting a node-local rank and the total number of ranks on this node as arguments. I don't think it would be necessary to support all craziness from the outset, such as an odd number of ranks per node or varying number of threads per rank etc. These can be added later on an as-needed basis.
FYI, I just merged #54. However, I did not extend pinthreads_mpi
for manual pinning to multi-node cases (which @sloede is interested in). Frankly, I also have little motivation and incentive to work on this because I almost exclusively use MPI on systems that use SLURM. Not saying it's not going to happen, just that it's low priority for me.
@sloede, perhaps you'd be willing to work on improving pinthreads_mpi
yourself? -> #61
In any case, since @giordano is probably already satisfied with the new pinthreads(:affinitymask)
(correct me if I'm wrong on this) I'm going to close this issue and open a new one specifically for pinthreads_mpi
(#61).
It'd be useful to at least document how to use
ThreadPinning.jl
in hybrid MPI+multithreaded applications.In an offline chat @carstenbauer suggested