Migrate CUDA imports to new variants in nixpkgs

aaronmondal commented 1 year ago

https://github.com/NixOS/nixpkgs/issues/224646#issuecomment-1498945232 mentioned that the way we currently import CUDA from nix is outdated. We should change imports from the outdated

pkgs.cudaPackages.cudatoolkit

to

cudaPackages.{lib,cuda_foo}

@JannisFengler @SpamDoodler This might make WSL compatibility work.

SomeoneSerge commented 1 year ago

I'm not familiar with WSL, but from a brief search they seem to be deploying their libcuda.so in /usr/lib/wsl/lib:

Note that if there are libraries in /usr/lib/wsl/lib other than libcuda.so/libnvidia-ml.so/etc (ones that NixOS deploys impurely), using LD_LIBRARY_PATH might result in conflicts, when /usr/lib/wsl/lib takes priority over dependencies recorded in the Runpaths by Nix

aaronmondal commented 1 year ago

Note that if there are libraries in /usr/lib/wsl/lib other than libcuda.so/libnvidia-ml.so/etc (ones that NixOS deploys impurely), using LD_LIBRARY_PATH might result in conflicts, when /usr/lib/wsl/lib takes priority over dependencies recorded in the Runpaths by Nix

We've encountered this before in https://github.com/eomii/rules_ll/issues/21. ATM we're advising against using /usr/...-style paths in the external dependency guide. Maybe symlinking the cuda-related paths to another directory and setting that via ldconfig would make sense.

I'd like to avoid it, but it might be necessary to check for the existence of WSL in our flake setup and explicitly set -l:libsomething.so and corresponding rpaths only for the *_nvptx toolchains.

Another way I could think of is symlinking only the impure libraries we actually need to another directory which we can when add to search paths via the LL_CUDA_* flags.

None of these options seem optimal to me though.

SomeoneSerge commented 1 year ago

Yes, you'd only need to symlink the libraries that version lock the driver: libcuda.so, libnvidia-ml.so... Adding that location into LD_LIBRARY_PATH should work, in principle. I haven'theard of LL_CUDA... before, is it WSL specific?

aaronmondal commented 1 year ago

Ahh sorry for the confusion no that's just our rules_ll-specific way of getting nix deps into bazel builds. We use something like this in our flake:

https://github.com/eomii/rules_ll/blob/1354042eb87baa2437877bc1fe6e05cb84a605eb/flake.nix#L110-L115

Which is then fed to the compilation sandboxes in Bazel:

https://github.com/eomii/rules_ll/blob/1354042eb87baa2437877bc1fe6e05cb84a605eb/flake.nix#L121-L142

Then it's consumed by some compilation paths:

https://github.com/eomii/rules_ll/blob/1354042eb87baa2437877bc1fe6e05cb84a605eb/ll/args.bzl#L229-L239

And some link actions:

https://github.com/eomii/rules_ll/blob/1354042eb87baa2437877bc1fe6e05cb84a605eb/ll/args.bzl#L492-L509

Bazel doesn't track dependencies outside of its build graph. We had explicit bazel-only imports before which mapped out all the files, but that felt too hacky and fragile to maintain. So at some point we decided to kick out that logic and just import it from the way easier manageable nix env 😄

We are actually doing the opposite for ROCm. There we have access to the source code and we have ported the ROCm/HIP build because that lets us build everything with our own C++-only toolchains for later consumption by *_amdgpu toolchains.

eomii / rules_ll

Migrate CUDA imports to new variants in nixpkgs #61