Open ptheywood opened 2 years ago
A user encountered performance issues with an OpenMP and MPI code.
Using 40 cores in openmpi, performance was good.
OMP_NUM_THREADS=40 ./myapp
When testing using OpenMP and MPI, performance was poor.
OMP_NUM_THREADS=40 mpirun -np 1 ./myapp
OMP_NUM_THREADS=20 mpirun -np 2 ./myapp
Binding to a given socket improved the perfomrance significantly.
OMP_NUM_THREADS=20 mpirun -np 2 --bind-to socket ./myapp
This may have been influenced by running on the login node, a batch job may not have shown the performance degredation if the scheduler correctly assigns cores from the same socket (although this may not always be possible, or if a full node is requested both sockets would be present).
This should be described on the MPI pages of the documentation
The binding story is potentially affected by slurm launching. mpirun(1) documents it for basic openmpi (which can be checked with --report-bindings). Experimentally hydra (mpich) doesn't do default binding (e.g. mpiexec -n 4 hwloc-ps). You should normally bind to the smallest topology level allowed by any threading, i.e. to core for single-threaded, L2 for two threads, socket for more threads (or preferably multiples of L2, e.g. --map-by socket:pe=4 for four threads/rank). I don't know whether hydra supports the same binding and mapping features as openmpi.
We investigated GPU locality early on and, as far as I remember, concluded openmpi under slurm just DTRT.
Cliff has made some valid suggestions for information which should be included in the documentation re: MPI
These may be included in some longer-form content providing examples of using MPI with CUDA being worked on.