Document MPI affinitiy / use of mpirun

ptheywood commented 2 years ago

Cliff has made some valid suggestions for information which should be included in the documentation re: MPI

Mapping GPU devices to MPI processes
Using mpirun to correctly place MPI processes w.r.t GPUs

These may be included in some longer-form content providing examples of using MPI with CUDA being worked on.

ptheywood commented 2 years ago

A user encountered performance issues with an OpenMP and MPI code.

Using 40 cores in openmpi, performance was good.

OMP_NUM_THREADS=40  ./myapp

When testing using OpenMP and MPI, performance was poor.

OMP_NUM_THREADS=40 mpirun -np 1  ./myapp

OMP_NUM_THREADS=20 mpirun -np 2  ./myapp

Binding to a given socket improved the perfomrance significantly.

OMP_NUM_THREADS=20 mpirun -np 2 --bind-to socket  ./myapp

This may have been influenced by running on the login node, a batch job may not have shown the performance degredation if the scheduler correctly assigns cores from the same socket (although this may not always be possible, or if a full node is requested both sockets would be present).

This should be described on the MPI pages of the documentation

loveshack commented 1 year ago

The binding story is potentially affected by slurm launching. mpirun(1) documents it for basic openmpi (which can be checked with --report-bindings). Experimentally hydra (mpich) doesn't do default binding (e.g. mpiexec -n 4 hwloc-ps). You should normally bind to the smallest topology level allowed by any threading, i.e. to core for single-threaded, L2 for two threads, socket for more threads (or preferably multiples of L2, e.g. --map-by socket:pe=4 for four threads/rank). I don't know whether hydra supports the same binding and mapping features as openmpi.

We investigated GPU locality early on and, as far as I remember, concluded openmpi under slurm just DTRT.

N8-CIR-Bede / documentation

Document MPI affinitiy / use of mpirun #100