Closed carstenbauer closed 1 month ago
Thought: Currently we ask the user to provide the world size and rank information via function arguments to pinthreads_mpi
. The idea was to avoid MPI.jl as a direct dependency. We should try to use the new (Julia 1.9) weak dependencies / extensions feature to use MPI.jl in TP.jl for this part without having to make MPI.jl a direct dependency.
As came up over at discourse, a feature like this - manually pinning threads of separate processes - can also be very useful when using the Distributed stdlib instead of MPI.
While playing around with this today, I found that figuring out the local rank as suggested in #51 does not always work, i.e. taking the modulus of the global rank with respect to number of processes per node (which you can provide as a user input); I ran into the situation where I was running two processes per node, and MPI decided to run all even ranks on one node and all odd on another node. The modulus then gives the same local rank for multiple processes.
Instead, I came up with the following function, that sends the hostname of each process to rank zero, who replies with unique local ranks to all processes (including itself, of course):
function get_local_rank(comm = MPI.COMM_WORLD)
my_rank = MPI.Comm_rank(comm)
if my_rank == 0
num_ranks = MPI.Comm_size(comm)
hostnames = Vector{String}(undef, num_ranks)
hostnames[1] = gethostname()
my_local_rank = -1
for i = 2:num_ranks
@info "Waiting on hostname from rank $(i-1)"
hostnames[i] = MPI.recv(comm, source=i-1)
end
@info "All hostnames" hostnames
uhostnames = unique(hostnames)
alone = length(uhostnames) == num_ranks
for n in uhostnames
for (i,j) in enumerate(findall(==(n), hostnames))
if j == 1
my_local_rank = i-1
continue
end
MPI.send((i-1, alone), comm, dest=j-1)
end
end
my_local_rank, alone
else
# Send our hostname
MPI.send(gethostname(), comm, dest=0)
# Wait for rank 0 to compute our local rank and if we are
# alone on this node and return it.
MPI.recv(comm, source=0)
end
end
Thanks! Once I'm done with the rewrite (cb/revamp
) we should restart the effort to add MPI (and Distributed) support via extensions (old effort: https://github.com/carstenbauer/ThreadPinning.jl/pull/64).
@jagot The revamp has landed on the main
branch. Please try it out. I might have a little bit of bandwidth left to try to work on the MPI integration. We'll see.
If there is no external affinity mask (e.g. set by SLURM) that one can utilize (with
pinthreads(:affinitymask)
) to pin Julia threads in (hybrid) MPI applications, we providepinthreads_mpi
to "manually" achieve a desired pinning pattern. However,pinthreads_mpi
is currently very bare-bones and doesn't support multinode scenarios yet.... Would be great to have it improved.This is low-priority for me, because I don't really need it.
(cc @sloede)