EESSI / software-layer

Software layer of the EESSI project
https://eessi.github.io/docs/software_layer
GNU General Public License v2.0
23 stars 46 forks source link

Ability to inject host site versions of libfabric/UCX #252

Open ocaisa opened 1 year ago

ocaisa commented 1 year ago

We were looking into the case of the EFA fabric at AWS. What we provide in EESSI works with the fabric, but it is true that you get better performance with the libfabric version that they ship with the OS (Amazon Linux 2 in the case we investigated).

You can check this with:

LD_PRELOAD="/opt/amazon/efa/lib64/libfabric.so.1 /lib64/libefa.so.1" mpirun --mca pml cm osu_bibw

and compare that to

mpirun --mca pml cm osu_bibw

As things stand, we've only built in capabilities to switch out the MPI library, but it may be better/easier to switch out the UCX/libfabric libraries.

ocaisa commented 1 year ago

Issue with making those libraries available is that we don't control the elf header of (an injected) libfabric.so.1 and that is the one that tries to load libefa.so.1...that means we can override the link to libfabric.so.1 but not libefa.so.1.

The only thing I can think of right now is the use LD_LIBRARY_PATH to the same location as the overrides and have a copy of the necessary library/libraries there. This sounds like another good reason to have the init scripts be a symlink.

ocaisa commented 1 year ago

One solution may be to use the the Gentoo Prefix equivalent of /etc/ld.so.preload (not ideal though, as these are preloaded for everything that uses the prefix linker)

ocaisa commented 1 year ago

Another option would be to ask AWS to build libfabric with RUNPATH support for /usr/lib{64}. It may sound weird but it would mean the host libefa.so.1 would be picked up before the one we ship in the compat layer.

ocaisa commented 1 year ago

In the older EESSI versions we saw some performance issues from GROMACS when injecting libfabric. With the latest EESSI release (2023.06), these issue don't exist, and we can inject both libfabric and the AWS provided OpenMPI (4.1.5 as opposed to 4.1.1 used by the module GROMACS/2021.3-foss-2021a):

[EESSI pilot 2023.06] $ mpirun --mca pml cm gmx_mpi mdrun -s ion_channel.tpr -maxh 0.50 -resethway -noconfout -nsteps 10000 -g logfile
...
                 (ns/day)    (hour/ns)
Performance:       55.402        0.433

[EESSI pilot 2023.06] $ LD_PRELOAD=/opt/amazon/efa/lib64/libfabric.so.1:/lib64/libefa.so.1 mpirun --mca pml cm gmx_mpi mdrun -s ion_channel.tpr -maxh 0.50 -resethway -noconfout -nsteps 10000 -g logfile
...
                 (ns/day)    (hour/ns)
Performance:       57.946        0.414

[EESSI pilot 2023.06] $ LD_PRELOAD="/opt/amazon/openmpi/lib64/libmpi.so.40 /lib64/libhwloc.so.5 /lib64/libevent_core-2.0.so.5 /opt/amazon/openmpi/lib64/libopen-rte.so.40 /opt/amazon/openmpi/lib64/libopen-pal.so.40 /lib64/libnl-3.so.200 /lib64/libnl-route-3.so.200 /lib64/libevent_pthreads-2.0.so.5 /opt/amazon/efa/lib64/libfabric.so.1 /lib64/libefa.so.1" mpirun --mca pml cm gmx_mpi mdrun -s ion_channel.tpr -maxh 0.50 -resethway -noconfout -nsteps 10000 -g logfile
...
                 (ns/day)    (hour/ns)
Performance:       58.356        0.411

So, by injecting the libraries we get about a 5% performance improvement.

ocaisa commented 1 year ago

LD_PRELOAD is clumsy and I would prefer to figure out another way to do the injection. I wonder if I can use patchelf to rewrite the elf header of the AWS MPI library to find all it's required libraries. I tried setting the rpath header but this worked too well, injecting more libraries than I actually wanted. I may need to replace .so dependencies by their full path.

ocaisa commented 1 year ago

patchelf does indeed seem to provide a way forward:

# Force libmpi to resolve unavailable libraries to the system versions
cp /opt/amazon/openmpi/lib64/libmpi.so.40 .
patchelf --replace-needed libhwloc.so.5 /lib64/libhwloc.so.5 libmpi.so.40
patchelf --replace-needed libevent_core-2.0.so.5 /lib64/libevent_core-2.0.so.5 libmpi.so.40
patchelf --replace-needed libevent_pthreads-2.0.so.5 /lib64/libevent_pthreads-2.0.so.5 libmpi.so.40
# Do the same for libfabric (I'm forcing use of system `libefa` here) 
cp /opt/amazon/efa/lib64/libfabric.so.1 .
patchelf --replace-needed libefa.so.1 /lib64/libefa.so.1 libfabric.so.1 
patchelf --add-needed /lib64/libnl-route-3.so.200 libfabric.so.1  # Required by system libefa
patchelf --add-needed /lib64/libnl-3.so.200 libfabric.so.1        # Required by system libefa
# Make libmpi depend on the patched libfabric thereby effectively forcing a preload of that lib
patchelf --add-needed $PWD/libfabric.so.1 libmpi.so.40 

with those changes I was able to run

LD_PRELOAD="$PWD/libmpi.so.40" mpirun --mca pml cm gmx_mpi mdrun -s ion_channel.tpr -maxh 0.50 -resethway -noconfout -nsteps 10000 -g logfile

This means if the modified libmpi is found first then we should be fully replacing the MPI/libfabric of the application.

Due to our EasyBuild hook for rpath, we don't need LD_PRELOAD to be able to force the exectuable to find the library:

mkdir -p /cvmfs/pilot.eessi-hpc.org/host_injections/2023.06/software/linux/aarch64/neoverse_n1/rpath_overrides/OpenMPI/system/lib
mv libfabric.so.1 libmpi.so.40 /cvmfs/pilot.eessi-hpc.org/host_injections/2023.06/software/linux/aarch64/neoverse_n1/rpath_overrides/OpenMPI/system/lib

and we we can then see that EESSI resolves libmpi to the injected library with ldd $(which gmx_mpi).