easybuilders / easybuild-framework

EasyBuild is a software installation framework in Python that allows you to install software in a structured and robust way.
https://easybuild.io
GNU General Public License v2.0
152 stars 202 forks source link

Better handling of `RPATH` for Binary installations #3857

Open casparvl opened 3 years ago

casparvl commented 3 years ago

The Issue When building binaries and shared libraries with EasyBuild when RPATH support is enabled (--rpath), the EasyBuild RPATH wrappers inject an RPATH into the binary/shared library header. However, for binary installations, nothing is currently done to support the setting of RPATH.

The way this is currently 'handled' in EasyBuild is that for example the Binary EasyBlock just skips the RPATH sanity check. That means the installation will complete succesfully, but if the site has EasyBuild configured to filter LD_LIBRARY_PATH from the modulefiles (a fair approach when you set RPATH), your binary won't work. Plus, its not really the behaviour you might expect: you configured EasyBuild with RPATH support, yet you get binaries that aren't RPATH-ed.

This issue is also problematic for projects such as EESSI and (I guess) the Compute Canada software stacks, that rely heavily on RPATH in order to not pick up libraries from the right place - and not from the OS.

Right now, at our site, we sometimes explicitely include patchelf as builddependency, and execute a patchelf command using postinstallcmds to inject the RPATH (see e.g. https://github.com/easybuilders/easybuild-easyconfigs/pull/13716 ). However, this is bad in two ways: it means you customize this at the level of the EasyConfig, which means sites with/without RPATH support would need different EasyConfigs. That doesn't make sense. It should be done at a higher level, e.g. the easyblock should check whether RPATH support is enabled, and if so, inject it.

Proposed solution I propose we implement some mechanism for EasyBuild to set the RPATH in a generic way for binary installations. As discussed with @boegel , the following approch probably makes most sense for that:

akesandgren commented 3 years ago

This is not so easy for things like ANSYS and other really large binary installs.

casparvl commented 3 years ago

True, but if it were easy, would it be rewarding? ;-) Plus, we now have nothing. It can only be an improvement, right?

We don't have to call this function in the ANSYS easyblock. Or we make it another option: --rpath-binaries, so that you can switch it on for cases where you may want it (simple binary installs) and maybe not for complicated cases (ANSYS)?

The generic function to set RPATHs will need to do the following

  1. Locate all binaries and shared libraries that require an RPATH to be set.
  2. Decide which RPATH to set.
  3. Locate a usable patchelf.
  4. Set the RPATH
  5. (optional) strip any unneeded part of RPATH with patchelf --shrink-rpath

Some things to discuss/think about:

  1. Since the RPATH sanity check also does something like this, I guess we can reuse part of the logic from there.
  2. We could simply set all paths in LD_LIBRARY_PATH, and potentially something like $ORIGIN and a few other relative dirs. Setting everything in LD_LIBRARY_PATH could be a lot, but we can later get rid of unnecessary components in step 5.
  3. Do we rely on a system patchelf? We can always print an error if it doesn't exist and suggest the user to load a module, and pass it as --allow-loaded-modules. Alternatively, we would needs logic that automatically injects patchelf as build dependency if --rpath is enabled (but: which version?). Too keep things simple, I think we should start with the first approach: rely on a system patchelf (or one loaded by the user and allow-ed)
casparvl commented 3 years ago

As an example, I created an EasyBlock & EasyConfig for NCCL:

# nccl.py
import os

import easybuild.tools.environment as env

from easybuild.easyblocks.generic.makecp import MakeCp
from easybuild.framework.easyblock import DEFAULT_BIN_LIB_SUBDIRS
from easybuild.tools.run import run_cmd

class EB_NCCL(MakeCp):
    """Support for building NCCL"""

#    def __init__(self, *args, **kwargs):
#        """Initialize NCCL specific variables"""

# TODO:
# - Filter RPATHS for things like cuda stubs, just like RPATH wrappers
# - Make running this post_install_step conditional on whether or not --rpath is set
# - Invoke super's post_install_step
    def post_install_step(self, rpath_dirs=None):
        """Set RPATHs"""

        ld_lib_path = os.getenv('LD_LIBRARY_PATH')
        self.log.debug("LD_LIBRARY_PATH = %s" % ld_lib_path)

        if rpath_dirs is None:
            rpath_dirs = self.cfg['bin_lib_subdirs'] or self.bin_lib_subdirs()

        if not rpath_dirs:
            rpath_dirs = DEFAULT_BIN_LIB_SUBDIRS
            self.log.info("Using default subdirectories for binaries/libraries to verify RPATH linking: %s",
                          rpath_dirs)
        else:
            self.log.info("Using specified subdirectories for binaries/libraries to verify RPATH linking: %s",
                          rpath_dirs)

        for dirpath in [os.path.join(self.installdir, d) for d in rpath_dirs]:
            if os.path.exists(dirpath):
                self.log.debug("Setting RPATH for files in %s", dirpath)

                for path in [os.path.join(dirpath, x) for x in os.listdir(dirpath)]:
                    self.log.debug("Setting RPATH for %s", path)

                    out, ec = run_cmd("file %s" % path, simple=False, trace=False)
                    if ec:
                        fail_msg = "Failed to run 'file %s': %s" % (path, out)
                        self.log.warning(fail_msg)
                        fails.append(fail_msg)

                    # only run ldd/readelf on dynamically linked executables/libraries
                    # example output:
                    # ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), ...
                    # ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, not stripped
                    if "dynamically linked" in out:
                        self.log.debug("%s is a dynamically linked file. Lets try to set RPATH" % path)
                        out, ec = run_cmd("patchelf --print-rpath %s" % path)
                        self.log.debug("Old RPATH: %s" % out)
                        # TODO: if old RPATH is not empty, print error, or warning? Or append it to the ld_lib_path before setting it?
                        out, ec = run_cmd("patchelf --force-rpath --set-rpath %s %s" % (ld_lib_path, path))
                        # Remove unneeded RPATH entries:
                        out, ec = run_cmd("patchelf --force-rpath --shrink-rpath %s" % path)
                    else:
                        self.log.debug("%s is not dynamically linked, so skipping setting of RPATH", path)

And as EasyConfig, I took NCCL-2.9.9-CUDA-11.3.0.eb from the repo, bumped it to the GCCcore instead of system level, and removed the easyblock = MakeCp so that it now uses my nccl.py EasyBlock.

# NCCL-2.9.9-GCCcore-10.3.0-CUDA-11.3.0.eb

name = 'NCCL'
version = '2.9.9'
local_cuda_version = '11.3.1'
versionsuffix = '-CUDA-%s' % local_cuda_version

homepage = 'https://developer.nvidia.com/nccl'
description = """The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective
communication primitives that are performance optimized for NVIDIA GPUs."""

toolchain = {'name': 'GCCcore', 'version': '10.3.0'}

local_cuda_version_major_minor = '.'.join(local_cuda_version.split('.')[:2])

# Download from https://developer.nvidia.com/nccl/nccl-download (after log in)
sources = ['%%(namelower)s_%%(version)s-1+cuda%s_%%(arch)s.txz' % local_cuda_version_major_minor]
checksums = [
    {
        '%%(namelower)s_%%(version)s-1+cuda%s_aarch64.txz' % local_cuda_version_major_minor:
            'b3291328990785807ba01d20a730bed86fbf84d28194da7e8194c9503d0eaefe',
        '%%(namelower)s_%%(version)s-1+cuda%s_ppc64le.txz' % local_cuda_version_major_minor:
            '11dec74b9397c100588dcc96a8c23ffc89567523d47047bcf5d12f625c1c13b6',
        '%%(namelower)s_%%(version)s-1+cuda%s_x86_64.txz' % local_cuda_version_major_minor:
            '59df720e039fd8a765ceb94749e0a8ca9d7bb3dd9bec4867f500e8f0325262c8',
    }
]

dependencies = [('CUDA', local_cuda_version, '', True)]

skipsteps = ['build']

files_to_copy = ['lib', 'include']

sanity_check_paths = {
    'files': ['lib/libnccl.%s' % SHLIB_EXT, 'lib/libnccl_static.a', 'include/nccl.h'],
    'dirs': ['include'],
}

moduleclass = 'lib'

Where this would crash before with in the RPATH sanity check, it now passes. Some sample output from the RPATH sanity check:

== 2021-10-13 17:06:10,674 run.py:233 INFO running cmd: ldd /home/casparl/.local/easybuild/Debian10/2021/software/NCCL/2.9.9-GCCcore-10.3.0-CUDA-11.3.1/lib/libnccl.so.2.9.9
== 2021-10-13 17:06:10,727 run.py:584 DEBUG cmd "ldd /home/casparl/.local/easybuild/Debian10/2021/software/NCCL/2.9.9-GCCcore-10.3.0-CUDA-11.3.1/lib/libnccl.so.2.9.9" exited with exit code 0 and output:
        linux-vdso.so.1 (0x00007ffc4afc6000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x000014623d60e000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x000014623d604000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x000014623d5ff000)
        libstdc++.so.6 => /sw/arch/Debian10/EB_production/2021/software/GCCcore/10.3.0/lib64/libstdc++.so.6 (0x000014623d418000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000014623d293000)
        libgcc_s.so.1 => /sw/arch/Debian10/EB_production/2021/software/GCCcore/10.3.0/lib64/libgcc_s.so.1 (0x000014623d279000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000014623d0b8000)
        /lib64/ld-linux-x86-64.so.2 (0x0000146244f61000)
...
== 2021-10-13 17:06:10,757 run.py:584 DEBUG cmd "readelf -d /home/casparl/.local/easybuild/Debian10/2021/software/NCCL/2.9.9-GCCcore-10.3.0-CUDA-11.3.1/lib/libnccl.so.2.9.9" exited with exit code 0 and output:

Dynamic section at offset 0x7def000 contains 33 entries:
  Tag        Type                         Name/Value
 0x000000000000000f (RPATH)              Library rpath: [/sw/arch/Debian10/EB_production/2021/software/GCCcore/10.3.0/lib64]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [librt.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libdl.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [ld-linux-x86-64.so.2]
 0x000000000000000e (SONAME)             Library soname: [libnccl.so.2]
...

Just as I would expect: it now picks up the libstdc++ and libgcc_s from GCCcore/10.3.0.

Note that a build against the system toolchain would still fail, since it explicitely checks for an RPATH entry in the elf header. Thus, a binary that only requires system libs (such as a NCCL installed at system level) would still fail the current RPATH sanity check. We should probably decide that in some cases, missing an RPATH entry in the elf header is not a reason to fail the sanity check as long as ldd shows no missing libraries, but I'd need to think about the scenarios where that makes sense.

Another thing is that I now put this logic in an EasyBlock. It would be much nicer if we can seperate that into a single function, so that one would only have to include something like

    def post_install_step(self, rpath_dirs=None):
        some_generic_functions_setting_rpath(rpath_dirs)