NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
17.87k stars 13.93k forks source link

python311Packages.jaxlib: shouldn't force symlinkJoins into RUNPATH #323619

Open SomeoneSerge opened 3 months ago

SomeoneSerge commented 3 months ago

SymlinkJoins in the context of cuda packages are an ugly hack used at configuration time to accommodate build systems that don't support split outputs ("splayed layouts"). We allow these as a compromise at build time, but we do not want to keep references to the symlink farms in the runtime closures: they are extremely expensive to store, to say nothing of other things.

https://github.com/NixOS/nixpkgs/blob/e6cdd8a11b26b4d60593733106042141756b54a3/pkgs/development/python-modules/jaxlib/default.nix#L436-L444

CC @samuela

samuela commented 3 months ago

they are extremely expensive to store, to say nothing of other things.

why are they expensive to store? isn't it just a hierarchy of lightweight symlinks?

in any case, i'm all for cleaning this stuff up

SomeoneSerge commented 3 months ago

why are they expensive to store? isn't it just a hierarchy of lightweight symlinks?

Because they reference potentially unused dependencies, preventing their garbage collection and pulling them into images. If there was a static archive in the inputs, it'll be always be pulled in by the outputs.

samuela commented 3 months ago

Because they reference potentially unused dependencies, preventing their garbage collection and pulling them into images.

But doesn't that apply to any use of split cudaPackages? Why is symlinkJoin at fault?

SomeoneSerge commented 3 months ago

But doesn't that apply to any use of split cudaPackages? Why is symlinkJoin at fault?

So you're bulding an output, e.g. torch. The only way in which torch will retain a reference to an input from cudaPackages is usually through a DSO, e.g. libcudart.so or libcublas.so - because it links them via DT_RUNPATH. There's no reason for torch to "reference" any other piece of cuda. With the symlinkJoin, licublas.so and libcublas.a appear to live in the same directory, the same store path, because we'll reference the symlinks instead of their targets. With the current approach, libcublas.so is only visible from licublas.lib and licublas.a from licublas.static.

Similarly, the stub driver with the previous approach is visible from cuda_cudart.out (as a symlink) and from cuda_cudart.stubs. And we end up putting the one from .out into LD_LIBRARY_PATH, which takes priority of RUNPATH's /run/opengl-driver/lib. Now the stub library is only visible from .stubs (and from ${getLib cuda_cudart}/lib/stubs/libcuda.so)

EDIT: oh I forgot which issue/PR I'm replying in; "now" referes to "after we've switched to propagatedBuildInputs instead of symlinkJoin"