Open ocaisa opened 1 year ago
Issue with making those libraries available is that we don't control the elf header of (an injected) libfabric.so.1
and that is the one that tries to load libefa.so.1
...that means we can override the link to libfabric.so.1
but not libefa.so.1
.
The only thing I can think of right now is the use LD_LIBRARY_PATH to the same location as the overrides and have a copy of the necessary library/libraries there. This sounds like another good reason to have the init scripts be a symlink.
One solution may be to use the the Gentoo Prefix equivalent of /etc/ld.so.preload
(not ideal though, as these are preloaded for everything that uses the prefix linker)
Another option would be to ask AWS to build libfabric with RUNPATH support for /usr/lib{64}
. It may sound weird but it would mean the host libefa.so.1
would be picked up before the one we ship in the compat layer.
In the older EESSI versions we saw some performance issues from GROMACS when injecting libfabric
. With the latest EESSI release (2023.06
), these issue don't exist, and we can inject both libfabric and the AWS provided OpenMPI (4.1.5
as opposed to 4.1.1
used by the module GROMACS/2021.3-foss-2021a
):
[EESSI pilot 2023.06] $ mpirun --mca pml cm gmx_mpi mdrun -s ion_channel.tpr -maxh 0.50 -resethway -noconfout -nsteps 10000 -g logfile
...
(ns/day) (hour/ns)
Performance: 55.402 0.433
[EESSI pilot 2023.06] $ LD_PRELOAD=/opt/amazon/efa/lib64/libfabric.so.1:/lib64/libefa.so.1 mpirun --mca pml cm gmx_mpi mdrun -s ion_channel.tpr -maxh 0.50 -resethway -noconfout -nsteps 10000 -g logfile
...
(ns/day) (hour/ns)
Performance: 57.946 0.414
[EESSI pilot 2023.06] $ LD_PRELOAD="/opt/amazon/openmpi/lib64/libmpi.so.40 /lib64/libhwloc.so.5 /lib64/libevent_core-2.0.so.5 /opt/amazon/openmpi/lib64/libopen-rte.so.40 /opt/amazon/openmpi/lib64/libopen-pal.so.40 /lib64/libnl-3.so.200 /lib64/libnl-route-3.so.200 /lib64/libevent_pthreads-2.0.so.5 /opt/amazon/efa/lib64/libfabric.so.1 /lib64/libefa.so.1" mpirun --mca pml cm gmx_mpi mdrun -s ion_channel.tpr -maxh 0.50 -resethway -noconfout -nsteps 10000 -g logfile
...
(ns/day) (hour/ns)
Performance: 58.356 0.411
So, by injecting the libraries we get about a 5% performance improvement.
LD_PRELOAD is clumsy and I would prefer to figure out another way to do the injection. I wonder if I can use patchelf
to rewrite the elf header of the AWS MPI library to find all it's required libraries. I tried setting the rpath header but this worked too well, injecting more libraries than I actually wanted. I may need to replace .so
dependencies by their full path.
patchelf
does indeed seem to provide a way forward:
# Force libmpi to resolve unavailable libraries to the system versions
cp /opt/amazon/openmpi/lib64/libmpi.so.40 .
patchelf --replace-needed libhwloc.so.5 /lib64/libhwloc.so.5 libmpi.so.40
patchelf --replace-needed libevent_core-2.0.so.5 /lib64/libevent_core-2.0.so.5 libmpi.so.40
patchelf --replace-needed libevent_pthreads-2.0.so.5 /lib64/libevent_pthreads-2.0.so.5 libmpi.so.40
# Do the same for libfabric (I'm forcing use of system `libefa` here)
cp /opt/amazon/efa/lib64/libfabric.so.1 .
patchelf --replace-needed libefa.so.1 /lib64/libefa.so.1 libfabric.so.1
patchelf --add-needed /lib64/libnl-route-3.so.200 libfabric.so.1 # Required by system libefa
patchelf --add-needed /lib64/libnl-3.so.200 libfabric.so.1 # Required by system libefa
# Make libmpi depend on the patched libfabric thereby effectively forcing a preload of that lib
patchelf --add-needed $PWD/libfabric.so.1 libmpi.so.40
with those changes I was able to run
LD_PRELOAD="$PWD/libmpi.so.40" mpirun --mca pml cm gmx_mpi mdrun -s ion_channel.tpr -maxh 0.50 -resethway -noconfout -nsteps 10000 -g logfile
This means if the modified libmpi
is found first then we should be fully replacing the MPI/libfabric of the application.
Due to our EasyBuild hook for rpath, we don't need LD_PRELOAD to be able to force the exectuable to find the library:
mkdir -p /cvmfs/pilot.eessi-hpc.org/host_injections/2023.06/software/linux/aarch64/neoverse_n1/rpath_overrides/OpenMPI/system/lib
mv libfabric.so.1 libmpi.so.40 /cvmfs/pilot.eessi-hpc.org/host_injections/2023.06/software/linux/aarch64/neoverse_n1/rpath_overrides/OpenMPI/system/lib
and we we can then see that EESSI resolves libmpi to the injected library with ldd $(which gmx_mpi)
.
We were looking into the case of the EFA fabric at AWS. What we provide in EESSI works with the fabric, but it is true that you get better performance with the libfabric version that they ship with the OS (Amazon Linux 2 in the case we investigated).
You can check this with:
and compare that to
As things stand, we've only built in capabilities to switch out the MPI library, but it may be better/easier to switch out the UCX/libfabric libraries.