Open ocaisa opened 3 years ago
I've tested this with other OpenMPI from EESSI, and it worked out of the box when using another MPI built on top of the same Gentoo Prefix. When trying to use MPI from a different prefix layer, I also had to add
system/lib:
total 0
lrwxrwxrwx. 1 ocaisa1 ocaisa1 78 Jul 2 12:16 libdl.so.2 -> /cvmfs/pilot.eessi-hpc.org/2021.06/compat/linux/x86_64/lib/../lib64/libdl.so.2
lrwxrwxrwx. 1 ocaisa1 ocaisa1 115 Jul 2 12:15 libmpi.so.40 -> /cvmfs/pilot.eessi-hpc.org/2021.03/software/linux/x86_64/amd/zen2/software/OpenMPI/4.0.3-GCC-9.3.0/lib/libmpi.so.40
When using an injected OpenMPI, if it is rpath-ed you should have no problems. If it is not, then the ldd
test will probably indicate some missing libraries. These will also need to be placed in /cvmfs/pilot.eessi-hpc.org/host_injections/2021.06/software/linux/x86_64/amd/zen2/rpath_overrides/OpenMPI/system/lib
or you use LD_LIBRARY_PATH
to have them found (but do not put /usr/lib(64)
in LD_LIBRARY_PATH
, this will break the compat layer).
I tried to get this to work together with Singularity but have not had success yet, advice on how to do this is welcome!
I also used MPI directly from the host (OpenMPI 3 which is ABI compatible with 4), this also worked but there were a few warnings (that can be suppressed with OMPI_MCA_mca_base_component_show_load_errors=0
)
On CentOS 7, my directories that made this work looked like:
[ocaisa1@node1 OpenMPI]$ pwd
/cvmfs/pilot.eessi-hpc.org/host_injections/2021.06/software/linux/x86_64/amd/zen2/rpath_overrides/OpenMPI
[ocaisa1@node1 OpenMPI]$ ls -l
total 0
drwxrwxr-x. 3 ocaisa1 ocaisa1 17 Jul 2 14:12 OpenMPI_eessi
drwxrwxr-x. 3 ocaisa1 ocaisa1 17 Jul 2 14:13 OpenMPI_host
lrwxrwxrwx. 1 ocaisa1 ocaisa1 12 Jul 2 14:14 system -> OpenMPI_host
[ocaisa1@node1 OpenMPI]$ ls -l OpenMPI_*/lib
OpenMPI_eessi/lib:
total 0
lrwxrwxrwx. 1 ocaisa1 ocaisa1 78 Jul 2 12:16 libdl.so.2 -> /cvmfs/pilot.eessi-hpc.org/2021.06/compat/linux/x86_64/lib/../lib64/libdl.so.2
lrwxrwxrwx. 1 ocaisa1 ocaisa1 115 Jul 2 12:15 libmpi.so.40 -> /cvmfs/pilot.eessi-hpc.org/2021.03/software/linux/x86_64/amd/zen2/software/OpenMPI/4.0.3-GCC-9.3.0/lib/libmpi.so.40
OpenMPI_host/lib:
total 0
lrwxrwxrwx. 1 ocaisa1 ocaisa1 24 Jul 2 12:26 libhwloc.so.5 -> /usr/lib64/libhwloc.so.5
lrwxrwxrwx. 1 ocaisa1 ocaisa1 36 Jul 2 12:25 libmpi.so.40 -> /usr/lib64/openmpi3/lib/libmpi.so.40
Both of these tests had rpath-ed OpenMPI builds, without RPATH you would need to add additional libraries (libopen-rte.so.40
, libopen-pal.so.40
are the minimum I think...or you just use LD_LIBRARY_PATH
)
I also tried something similar to this to test overriding MPI on AWS Skylake with EFA (using LD_PRELOAD
to force picking up my provided libraries as we don't have a 2021.06
stack for this yet). There is no real performance difference (minimum latency about 17 microseconds, maximum p2p bandwidth of about 9000 MB/s), however it is clear that there are cases where this may not be perfect:
Program: gmx mdrun, version 2020.4-MODIFIED
Source file: src/gromacs/hardware/hardwaretopology.cpp (line 614)
Function: gmx::{anonymous}::parseHwLoc(gmx::HardwareTopology::Machine*, gmx::HardwareTopology::SupportLevel*, bool*)::<lambda()>
MPI rank: 3 (out of 48)
Assertion failed:
Condition: (hwloc_get_api_version() >= 0x20000)
Mismatch between hwloc headers and library, using v2 headers with v1 library
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
The problem is that hwloc
version needed by the AWS OpenMPI is not API compatible with the one used by GROMACS. Hopefully a corner case...
Regarding hwloc
, it is wise that we inspect the ABI of the version that gets pulled in with MPI and check it's compatibility with the version that EESSI uses (https://www.open-mpi.org/projects/hwloc/doc/v2.4.0/a00364.php#faq_version_abi). The issue is likely to arise with older underlying OSes (like CentOS7), we don't need to fail out (since this would only affect packages that rely on MPI and have a hwloc
dependency) but we should probably write a warning that this issue may arise.
I've been doing some successful testing of https://github.com/EESSI/software-layer/pull/116 and I'd like others to also give it a try. For minimal testing you just need set up the override directory:
The actual test you can run is then (e.g.):