Fix for CUDA Toolkit packages containing incorrect RPATH

jakirkham commented 11 months ago

Introduction

We recently became aware of an issue in the cuda-nvtx-feedstock where the RPATHs in the libraries in the package were incorrect ( https://github.com/conda-forge/cuda-nvtx-feedstock/issues/2 ). These incorrect RPATHs are the result of the directory layout used for CUDA packages. All distributions of CUDA place their contents in a top-level targets directory with various subdirectories for different architectures to better support cross-compilation. The CUDA packages on conda-forge mimic this structure, but to support standard runtime library use cases, the library contents of CUDA packages are symlinked into the top-level lib directory. The problem is that due to how $ORIGIN is handled for symlinks, the RPATHs are set relative to the true library location at build time, but at runtime $ORIGIN is the location of the symlink rather than the true library location, and as a result at runtime the RPATHs result in package searches outside of the environment.

We would like to maintain the targets layout because it matches how CUDA is provided in other distributions. This also means we want to keep the real libraries in the targets directory rather than placing them directly in lib. We would also like to avoid ballooning the package size or adding any RPATHs that point outside the environment since that is broken at best and dangerous at worst. To satisfy all of these constraints, our proposed solution is to manually set the RPATH to $ORIGIN with patchelf during the conda package build step on all the libraries in the targets directory. At runtime, the RPATH setting of $ORIGIN will resolve to $PREFIX/lib, producing the desired behavior. There are some potential caveats to how this may work within the context of conda-build, as we discuss below, but we have verified that this produces the desired runtime results.

Problem Statement

The CTK packages are structured to have a runtime package and a -dev package.
- Example: cuda-nvtx-dev & cuda-nvtx. The runtime package, cuda-nvtx, contains the libraries.
- The -dev package has a dependency on the runtime package so that these libraries are available at build time.
Library files are in paths like $PREFIX/targets/<arch>/*.so*.
- These are used to link against at build time.
- This is the preferred location, because it communicates that we have a cross-compiler-friendly library location and matches CUDA packages in other distribution forms.
Library files are symlinked into $PREFIX/lib/*.so*
Conda-build detects the .so files in the deeper folder, $PREFIX/targets/<arch>/lib.
- It sets the RPATH to be $ORIGIN/../../../lib.
At runtime, the symlink is found in $PREFIX/lib.
1. The library at $PREFIX/targets/<arch>/lib is loaded.
2. Its RPATH is $ORIGIN/../../../lib
3. $ORIGIN is considered to be $PREFIX/lib
4. Due to this, the library search path goes outside of the environment.

This can result in a functioning environment, if either:

The environment is not the base environment
The environment is contained within the base environment’s envs folder
The base environment contains compatible libraries

Or:

The compatible libraries are accessible via LD_LIBRARY_PATH or standard ld.so search paths.

If either of those cases are not met, the environment will not be functional.

Our Solution

Keep existing file locations and symlink direction
- Actual library files in targets/…/lib
- Symlink in lib for each CUDA library that points to the library in ../targets/<arch>/…
Use patchelf to set RPATH to $ORIGIN for libraries $PREFIX/targets/<arch>/*.so*
Set build: binary_relocation: false so that conda-build doesn’t otherwise change the RPATHs of these libraries
At runtime, loading libraries from their symlinks in $PREFIX/lib will look for libraries adjacent to the symlink in $PREFIX/lib.
- This is key to the NVIDIA libraries loading Conda’slibstdc++, instead of the system libstdc++.
- This relies on the assumption that libraries are never loaded at runtime from that targets/…/lib folder.
- The only functional runtime approach is to load them from $PREFIX/lib.
Conda-build’s missing DSO detection needs to be disabled by setting error_overlinking to false

Justification

This approach aligns more closely with how the CUDA Toolkit is distributed outside of conda than the alternatives we considered below. It also avoids unnecessarily bloating the package.

Considered Alternatives

Reverse symlink direction

This was originally proposed by @isuruf in https://github.com/conda-forge/cuda-nvtx-feedstock/pull/3
Instead of having the library files reside in targets/…/lib, the actual library file would be placed in $PREFIX/lib, and the symlink would be created in the targets/…/lib folder
Conda-build will detect the library’s location, and set RPATH to $ORIGIN/../lib
Loading the library from the targets/…/lib folder is broken, similar to the proposed solution. Library can only be loaded correctly from the $PREFIX/lib location.
Conda-build’s missing DSO detection should work correctly under this scheme.

Comments

This approach would result in a different CUDA Toolkit layout in Conda compared to other distributions. Alignment across CUDA Toolkit distributions is important for libraries using CUDA to have similar expectations and behaviors both inside and outside of conda environments.

Duplicating library in both locations

Instead of symlinking, the library files would be contained in both the *-dev and the runtime packages. It would exist in the targets/…/lib location in the *-dev package, and in $PREFIX/lib in the runtime package.
Conda-build would detect and correctly set RPATH in both instances.
- The library in the *-dev package would have an RPATH of $ORIGIN/../../../lib, which evaluates to $PREFIX/lib. Loading of sibling libraries in the targets folder would rely on fallback to RUNPATH, which is $ORIGIN.
- The library in the runtime package would have an RPATH of $ORIGIN/../lib, which again evaluates to $PREFIX/lib. Sibling libraries are present in this same folder in this package, so the fallback to RUNPATH doesn’t come into play.
Conda-build’s missing DSO detection should work correctly under this scheme.

Comments

The cuda metapackage makes the assumption that both build-time and run-time components are provided. Because we duplicate libraries in these packages between the -devel and runtime packages, the effective size of the cuda metapackage would be roughly doubled. This is prohibitive. Additionally, having -dev and -runtime variants of a metapackage is not favorable, because it would differ from other ways of distributing CUDA.

jakirkham commented 11 months ago

All fixes have been merged. Closing as completed

jakirkham commented 3 months ago

Reopening to look at bin where it appears similar work may be needed

jakirkham commented 2 months ago

cc @billysuh7 (to look at doing the same thing for binaries in a couple weeks)

jakirkham commented 3 weeks ago

Billy looked through the feedstocks and found the following ones still need RPATH fixes:

conda-forge / cuda-feedstock