Open casparvl opened 3 years ago
This is not so easy for things like ANSYS and other really large binary installs.
True, but if it were easy, would it be rewarding? ;-) Plus, we now have nothing. It can only be an improvement, right?
We don't have to call this function in the ANSYS
easyblock. Or we make it another option: --rpath-binaries
, so that you can switch it on for cases where you may want it (simple binary installs) and maybe not for complicated cases (ANSYS
)?
The generic function to set RPATH
s will need to do the following
RPATH
to be set. RPATH
to set.RPATH
RPATH
with patchelf --shrink-rpath
Some things to discuss/think about:
RPATH
sanity check also does something like this, I guess we can reuse part of the logic from there.LD_LIBRARY_PATH
, and potentially something like $ORIGIN
and a few other relative dirs. Setting everything in LD_LIBRARY_PATH
could be a lot, but we can later get rid of unnecessary components in step 5.patchelf
? We can always print an error if it doesn't exist and suggest the user to load a module, and pass it as --allow-loaded-modules
. Alternatively, we would needs logic that automatically injects patchelf
as build dependency if --rpath
is enabled (but: which version?). Too keep things simple, I think we should start with the first approach: rely on a system patchelf
(or one loaded by the user and allow
-ed)As an example, I created an EasyBlock & EasyConfig for NCCL
:
# nccl.py
import os
import easybuild.tools.environment as env
from easybuild.easyblocks.generic.makecp import MakeCp
from easybuild.framework.easyblock import DEFAULT_BIN_LIB_SUBDIRS
from easybuild.tools.run import run_cmd
class EB_NCCL(MakeCp):
"""Support for building NCCL"""
# def __init__(self, *args, **kwargs):
# """Initialize NCCL specific variables"""
# TODO:
# - Filter RPATHS for things like cuda stubs, just like RPATH wrappers
# - Make running this post_install_step conditional on whether or not --rpath is set
# - Invoke super's post_install_step
def post_install_step(self, rpath_dirs=None):
"""Set RPATHs"""
ld_lib_path = os.getenv('LD_LIBRARY_PATH')
self.log.debug("LD_LIBRARY_PATH = %s" % ld_lib_path)
if rpath_dirs is None:
rpath_dirs = self.cfg['bin_lib_subdirs'] or self.bin_lib_subdirs()
if not rpath_dirs:
rpath_dirs = DEFAULT_BIN_LIB_SUBDIRS
self.log.info("Using default subdirectories for binaries/libraries to verify RPATH linking: %s",
rpath_dirs)
else:
self.log.info("Using specified subdirectories for binaries/libraries to verify RPATH linking: %s",
rpath_dirs)
for dirpath in [os.path.join(self.installdir, d) for d in rpath_dirs]:
if os.path.exists(dirpath):
self.log.debug("Setting RPATH for files in %s", dirpath)
for path in [os.path.join(dirpath, x) for x in os.listdir(dirpath)]:
self.log.debug("Setting RPATH for %s", path)
out, ec = run_cmd("file %s" % path, simple=False, trace=False)
if ec:
fail_msg = "Failed to run 'file %s': %s" % (path, out)
self.log.warning(fail_msg)
fails.append(fail_msg)
# only run ldd/readelf on dynamically linked executables/libraries
# example output:
# ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), ...
# ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, not stripped
if "dynamically linked" in out:
self.log.debug("%s is a dynamically linked file. Lets try to set RPATH" % path)
out, ec = run_cmd("patchelf --print-rpath %s" % path)
self.log.debug("Old RPATH: %s" % out)
# TODO: if old RPATH is not empty, print error, or warning? Or append it to the ld_lib_path before setting it?
out, ec = run_cmd("patchelf --force-rpath --set-rpath %s %s" % (ld_lib_path, path))
# Remove unneeded RPATH entries:
out, ec = run_cmd("patchelf --force-rpath --shrink-rpath %s" % path)
else:
self.log.debug("%s is not dynamically linked, so skipping setting of RPATH", path)
And as EasyConfig, I took NCCL-2.9.9-CUDA-11.3.0.eb
from the repo, bumped it to the GCCcore
instead of system level, and removed the easyblock = MakeCp
so that it now uses my nccl.py
EasyBlock.
# NCCL-2.9.9-GCCcore-10.3.0-CUDA-11.3.0.eb
name = 'NCCL'
version = '2.9.9'
local_cuda_version = '11.3.1'
versionsuffix = '-CUDA-%s' % local_cuda_version
homepage = 'https://developer.nvidia.com/nccl'
description = """The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective
communication primitives that are performance optimized for NVIDIA GPUs."""
toolchain = {'name': 'GCCcore', 'version': '10.3.0'}
local_cuda_version_major_minor = '.'.join(local_cuda_version.split('.')[:2])
# Download from https://developer.nvidia.com/nccl/nccl-download (after log in)
sources = ['%%(namelower)s_%%(version)s-1+cuda%s_%%(arch)s.txz' % local_cuda_version_major_minor]
checksums = [
{
'%%(namelower)s_%%(version)s-1+cuda%s_aarch64.txz' % local_cuda_version_major_minor:
'b3291328990785807ba01d20a730bed86fbf84d28194da7e8194c9503d0eaefe',
'%%(namelower)s_%%(version)s-1+cuda%s_ppc64le.txz' % local_cuda_version_major_minor:
'11dec74b9397c100588dcc96a8c23ffc89567523d47047bcf5d12f625c1c13b6',
'%%(namelower)s_%%(version)s-1+cuda%s_x86_64.txz' % local_cuda_version_major_minor:
'59df720e039fd8a765ceb94749e0a8ca9d7bb3dd9bec4867f500e8f0325262c8',
}
]
dependencies = [('CUDA', local_cuda_version, '', True)]
skipsteps = ['build']
files_to_copy = ['lib', 'include']
sanity_check_paths = {
'files': ['lib/libnccl.%s' % SHLIB_EXT, 'lib/libnccl_static.a', 'include/nccl.h'],
'dirs': ['include'],
}
moduleclass = 'lib'
Where this would crash before with in the RPATH
sanity check, it now passes. Some sample output from the RPATH
sanity check:
== 2021-10-13 17:06:10,674 run.py:233 INFO running cmd: ldd /home/casparl/.local/easybuild/Debian10/2021/software/NCCL/2.9.9-GCCcore-10.3.0-CUDA-11.3.1/lib/libnccl.so.2.9.9
== 2021-10-13 17:06:10,727 run.py:584 DEBUG cmd "ldd /home/casparl/.local/easybuild/Debian10/2021/software/NCCL/2.9.9-GCCcore-10.3.0-CUDA-11.3.1/lib/libnccl.so.2.9.9" exited with exit code 0 and output:
linux-vdso.so.1 (0x00007ffc4afc6000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x000014623d60e000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x000014623d604000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x000014623d5ff000)
libstdc++.so.6 => /sw/arch/Debian10/EB_production/2021/software/GCCcore/10.3.0/lib64/libstdc++.so.6 (0x000014623d418000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000014623d293000)
libgcc_s.so.1 => /sw/arch/Debian10/EB_production/2021/software/GCCcore/10.3.0/lib64/libgcc_s.so.1 (0x000014623d279000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000014623d0b8000)
/lib64/ld-linux-x86-64.so.2 (0x0000146244f61000)
...
== 2021-10-13 17:06:10,757 run.py:584 DEBUG cmd "readelf -d /home/casparl/.local/easybuild/Debian10/2021/software/NCCL/2.9.9-GCCcore-10.3.0-CUDA-11.3.1/lib/libnccl.so.2.9.9" exited with exit code 0 and output:
Dynamic section at offset 0x7def000 contains 33 entries:
Tag Type Name/Value
0x000000000000000f (RPATH) Library rpath: [/sw/arch/Debian10/EB_production/2021/software/GCCcore/10.3.0/lib64]
0x0000000000000001 (NEEDED) Shared library: [libpthread.so.0]
0x0000000000000001 (NEEDED) Shared library: [librt.so.1]
0x0000000000000001 (NEEDED) Shared library: [libdl.so.2]
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libm.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
0x0000000000000001 (NEEDED) Shared library: [ld-linux-x86-64.so.2]
0x000000000000000e (SONAME) Library soname: [libnccl.so.2]
...
Just as I would expect: it now picks up the libstdc++
and libgcc_s
from GCCcore/10.3.0
.
Note that a build against the system toolchain would still fail, since it explicitely checks for an RPATH
entry in the elf header. Thus, a binary that only requires system libs (such as a NCCL
installed at system level) would still fail the current RPATH
sanity check. We should probably decide that in some cases, missing an RPATH
entry in the elf header is not a reason to fail the sanity check as long as ldd
shows no missing libraries, but I'd need to think about the scenarios where that makes sense.
Another thing is that I now put this logic in an EasyBlock. It would be much nicer if we can seperate that into a single function, so that one would only have to include something like
def post_install_step(self, rpath_dirs=None):
some_generic_functions_setting_rpath(rpath_dirs)
The Issue When building binaries and shared libraries with EasyBuild when
RPATH
support is enabled (--rpath
), the EasyBuildRPATH
wrappers inject anRPATH
into the binary/shared library header. However, for binary installations, nothing is currently done to support the setting ofRPATH
.The way this is currently 'handled' in EasyBuild is that for example the
Binary
EasyBlock just skips theRPATH
sanity check. That means the installation will complete succesfully, but if the site has EasyBuild configured to filterLD_LIBRARY_PATH
from the modulefiles (a fair approach when you setRPATH
), your binary won't work. Plus, its not really the behaviour you might expect: you configured EasyBuild withRPATH
support, yet you get binaries that aren'tRPATH
-ed.This issue is also problematic for projects such as
EESSI
and (I guess) the Compute Canada software stacks, that rely heavily onRPATH
in order to not pick up libraries from the right place - and not from the OS.Right now, at our site, we sometimes explicitely include
patchelf
as builddependency, and execute apatchelf
command usingpostinstallcmds
to inject the RPATH (see e.g. https://github.com/easybuilders/easybuild-easyconfigs/pull/13716 ). However, this is bad in two ways: it means you customize this at the level of the EasyConfig, which means sites with/withoutRPATH
support would need different EasyConfigs. That doesn't make sense. It should be done at a higher level, e.g. theeasyblock
should check whetherRPATH
support is enabled, and if so, inject it.Proposed solution I propose we implement some mechanism for EasyBuild to set the
RPATH
in a generic way for binary installations. As discussed with @boegel , the following approch probably makes most sense for that:easybuild-framework
that can be called to set theseRPATH
s--rpath
is enabled) from EasyBlocks that do binary installations (i.e.Binary
, and maybe more specific EasyBlocks such asMotionCor2
).