carstenbauer / ThreadPinning.jl

Readily pin Julia threads to CPU-threads
https://carstenbauer.github.io/ThreadPinning.jl/
MIT License
106 stars 7 forks source link

MPI: Improve "manual" pinning (`pinthreads_mpi`) #61

Closed carstenbauer closed 1 month ago

carstenbauer commented 1 year ago

If there is no external affinity mask (e.g. set by SLURM) that one can utilize (with pinthreads(:affinitymask)) to pin Julia threads in (hybrid) MPI applications, we provide pinthreads_mpi to "manually" achieve a desired pinning pattern. However, pinthreads_mpi is currently very bare-bones and doesn't support multinode scenarios yet.... Would be great to have it improved.

This is low-priority for me, because I don't really need it.

(cc @sloede)

carstenbauer commented 1 year ago

Thought: Currently we ask the user to provide the world size and rank information via function arguments to pinthreads_mpi. The idea was to avoid MPI.jl as a direct dependency. We should try to use the new (Julia 1.9) weak dependencies / extensions feature to use MPI.jl in TP.jl for this part without having to make MPI.jl a direct dependency.

carstenbauer commented 1 year ago

As came up over at discourse, a feature like this - manually pinning threads of separate processes - can also be very useful when using the Distributed stdlib instead of MPI.

jagot commented 1 month ago

While playing around with this today, I found that figuring out the local rank as suggested in #51 does not always work, i.e. taking the modulus of the global rank with respect to number of processes per node (which you can provide as a user input); I ran into the situation where I was running two processes per node, and MPI decided to run all even ranks on one node and all odd on another node. The modulus then gives the same local rank for multiple processes.

Instead, I came up with the following function, that sends the hostname of each process to rank zero, who replies with unique local ranks to all processes (including itself, of course):

function get_local_rank(comm = MPI.COMM_WORLD)
    my_rank = MPI.Comm_rank(comm)
    if my_rank == 0
        num_ranks = MPI.Comm_size(comm)
        hostnames = Vector{String}(undef, num_ranks)
        hostnames[1] = gethostname()
        my_local_rank = -1

        for i = 2:num_ranks
            @info "Waiting on hostname from rank $(i-1)"
            hostnames[i] = MPI.recv(comm, source=i-1)
        end

        @info "All hostnames" hostnames
        uhostnames = unique(hostnames)
        alone = length(uhostnames) == num_ranks

        for n in uhostnames
            for (i,j) in enumerate(findall(==(n), hostnames))
                if j == 1
                    my_local_rank = i-1
                    continue
                end
                MPI.send((i-1, alone), comm, dest=j-1)
            end
        end
        my_local_rank, alone
    else
        # Send our hostname
        MPI.send(gethostname(), comm, dest=0)
        # Wait for rank 0 to compute our local rank and if we are
        # alone on this node and return it.
        MPI.recv(comm, source=0)
    end
end
carstenbauer commented 1 month ago

Thanks! Once I'm done with the rewrite (cb/revamp) we should restart the effort to add MPI (and Distributed) support via extensions (old effort: https://github.com/carstenbauer/ThreadPinning.jl/pull/64).

carstenbauer commented 1 month ago

@jagot The revamp has landed on the main branch. Please try it out. I might have a little bit of bandwidth left to try to work on the MPI integration. We'll see.