EESSI / software-layer

Software layer of the EESSI project
https://eessi.github.io/docs/software_layer
GNU General Public License v2.0
24 stars 49 forks source link

Testing overriding of the MPI in EESSI #121

Open ocaisa opened 3 years ago

ocaisa commented 3 years ago

I've been doing some successful testing of https://github.com/EESSI/software-layer/pull/116 and I'd like others to also give it a try. For minimal testing you just need set up the override directory:

# Make a directory for our overrides
sudo mkdir -p /opt/eessi
# Let's allow working in user space
sudo chown $USER /opt/eessi
# Create the necessary directory structure (/cvmfs/pilot.eessi-hpc.org/host_injections is by default a symlink to /opt/eessi)
mkdir -p /cvmfs/pilot.eessi-hpc.org/host_injections/2021.06/software/linux/x86_64/amd/zen2/rpath_overrides/OpenMPI
# If the MPI you want to use is loaded as an eb module you can just do
ln -s $EBROOTOPENMPI /cvmfs/pilot.eessi-hpc.org/host_injections/2021.06/software/linux/x86_64/amd/zen2/rpath_overrides/OpenMPI/system  

The actual test you can run is then (e.g.):

# Check the default linker for the executable can find all the libraries
/cvmfs/pilot.eessi-hpc.org/2021.06/compat/linux/x86_64/usr/bin/ldd /cvmfs/pilot.eessi-hpc.org/2021.06/software/linux/x86_64/amd/zen2/software/OSU-Micro-Benchmarks/5.6.3-gompi-2020a/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw
# Run a simple MPI test
mpirun -n 2 /cvmfs/pilot.eessi-hpc.org/2021.06/software/linux/x86_64/amd/zen2/software/OSU-Micro-Benchmarks/5.6.3-gompi-2020a/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw
ocaisa commented 3 years ago

I've tested this with other OpenMPI from EESSI, and it worked out of the box when using another MPI built on top of the same Gentoo Prefix. When trying to use MPI from a different prefix layer, I also had to add

system/lib:
total 0
lrwxrwxrwx. 1 ocaisa1 ocaisa1  78 Jul  2 12:16 libdl.so.2 -> /cvmfs/pilot.eessi-hpc.org/2021.06/compat/linux/x86_64/lib/../lib64/libdl.so.2
lrwxrwxrwx. 1 ocaisa1 ocaisa1 115 Jul  2 12:15 libmpi.so.40 -> /cvmfs/pilot.eessi-hpc.org/2021.03/software/linux/x86_64/amd/zen2/software/OpenMPI/4.0.3-GCC-9.3.0/lib/libmpi.so.40

When using an injected OpenMPI, if it is rpath-ed you should have no problems. If it is not, then the ldd test will probably indicate some missing libraries. These will also need to be placed in /cvmfs/pilot.eessi-hpc.org/host_injections/2021.06/software/linux/x86_64/amd/zen2/rpath_overrides/OpenMPI/system/lib or you use LD_LIBRARY_PATH to have them found (but do not put /usr/lib(64) in LD_LIBRARY_PATH, this will break the compat layer).

I tried to get this to work together with Singularity but have not had success yet, advice on how to do this is welcome!

ocaisa commented 3 years ago

I also used MPI directly from the host (OpenMPI 3 which is ABI compatible with 4), this also worked but there were a few warnings (that can be suppressed with OMPI_MCA_mca_base_component_show_load_errors=0)

ocaisa commented 3 years ago

On CentOS 7, my directories that made this work looked like:

[ocaisa1@node1 OpenMPI]$ pwd
/cvmfs/pilot.eessi-hpc.org/host_injections/2021.06/software/linux/x86_64/amd/zen2/rpath_overrides/OpenMPI

[ocaisa1@node1 OpenMPI]$ ls -l
total 0
drwxrwxr-x. 3 ocaisa1 ocaisa1 17 Jul  2 14:12 OpenMPI_eessi
drwxrwxr-x. 3 ocaisa1 ocaisa1 17 Jul  2 14:13 OpenMPI_host
lrwxrwxrwx. 1 ocaisa1 ocaisa1 12 Jul  2 14:14 system -> OpenMPI_host

[ocaisa1@node1 OpenMPI]$ ls -l OpenMPI_*/lib
OpenMPI_eessi/lib:
total 0
lrwxrwxrwx. 1 ocaisa1 ocaisa1  78 Jul  2 12:16 libdl.so.2 -> /cvmfs/pilot.eessi-hpc.org/2021.06/compat/linux/x86_64/lib/../lib64/libdl.so.2
lrwxrwxrwx. 1 ocaisa1 ocaisa1 115 Jul  2 12:15 libmpi.so.40 -> /cvmfs/pilot.eessi-hpc.org/2021.03/software/linux/x86_64/amd/zen2/software/OpenMPI/4.0.3-GCC-9.3.0/lib/libmpi.so.40

OpenMPI_host/lib:
total 0
lrwxrwxrwx. 1 ocaisa1 ocaisa1 24 Jul  2 12:26 libhwloc.so.5 -> /usr/lib64/libhwloc.so.5
lrwxrwxrwx. 1 ocaisa1 ocaisa1 36 Jul  2 12:25 libmpi.so.40 -> /usr/lib64/openmpi3/lib/libmpi.so.40

Both of these tests had rpath-ed OpenMPI builds, without RPATH you would need to add additional libraries (libopen-rte.so.40, libopen-pal.so.40 are the minimum I think...or you just use LD_LIBRARY_PATH)

ocaisa commented 3 years ago

I also tried something similar to this to test overriding MPI on AWS Skylake with EFA (using LD_PRELOAD to force picking up my provided libraries as we don't have a 2021.06 stack for this yet). There is no real performance difference (minimum latency about 17 microseconds, maximum p2p bandwidth of about 9000 MB/s), however it is clear that there are cases where this may not be perfect:

Program:     gmx mdrun, version 2020.4-MODIFIED
Source file: src/gromacs/hardware/hardwaretopology.cpp (line 614)
Function:    gmx::{anonymous}::parseHwLoc(gmx::HardwareTopology::Machine*, gmx::HardwareTopology::SupportLevel*, bool*)::<lambda()>
MPI rank:    3 (out of 48)

Assertion failed:
Condition: (hwloc_get_api_version() >= 0x20000)
Mismatch between hwloc headers and library, using v2 headers with v1 library

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors

The problem is that hwloc version needed by the AWS OpenMPI is not API compatible with the one used by GROMACS. Hopefully a corner case...

Job script for latency/bandwidth ``` #!/bin/bash -x #SBATCH --time=00:20:00 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=1 #SBATCH --nodes=2 module load OSU-Micro-Benchmarks ldd $(which osu_latency) mpirun -n 2 osu_latency mpirun -n 2 osu_bw export LD_PRELOAD=/opt/amazon/openmpi/lib64/libmpi.so.40:/opt/amazon/openmpi/lib64/libopen-rte.so.40:/opt/amazon/openmpi/lib64/libopen-pal.so.40:/lib64/libhwloc.so.5:/lib64/libevent_core-2.0.so.5:/lib64/libevent_pthreads-2.0.so.5:/lib64/libnl-3.so.200:/lib64/libnl-route-3.so.200 ldd $(which osu_latency) /opt/amazon/openmpi/bin/mpirun -n 2 osu_latency /opt/amazon/openmpi/bin/mpirun -n 2 osu_bw ```
Job script for GROMACS test ``` #!/bin/bash -x #SBATCH --time=00:20:00 #SBATCH --ntasks-per-node=24 #SBATCH --cpus-per-task=2 #SBATCH --nodes=2 export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK module load GROMACS rm logfile.log ener.edr ldd $(which gmx_mpi) mpirun -n 48 gmx_mpi mdrun -s ion_channel.tpr -maxh 0.50 -resethway -noconfout -nsteps 20000 -g logfile -dlb yes export LD_PRELOAD=/opt/amazon/openmpi/lib64/libmpi.so.40:/opt/amazon/openmpi/lib64/libopen-rte.so.40:/opt/amazon/openmpi/lib64/libopen-pal.so.40:/lib64/libhwloc.so.5:/lib64/libevent_core-2.0.so.5:/lib64/libevent_pthreads-2.0.so.5:/lib64/libnl-3.so.200:/lib64/libnl-route-3.so.200 ldd $(which gmx_mpi) rm logfile.log ener.edr /opt/amazon/openmpi/bin/mpirun -n 48 gmx_mpi mdrun -s ion_channel.tpr -maxh 0.50 -resethway -noconfout -nsteps 20000 -g logfile -dlb yes ```
ocaisa commented 3 years ago

Regarding hwloc, it is wise that we inspect the ABI of the version that gets pulled in with MPI and check it's compatibility with the version that EESSI uses (https://www.open-mpi.org/projects/hwloc/doc/v2.4.0/a00364.php#faq_version_abi). The issue is likely to arise with older underlying OSes (like CentOS7), we don't need to fail out (since this would only affect packages that rely on MPI and have a hwloc dependency) but we should probably write a warning that this issue may arise.