[proposal] split up `cudatoolkit` package into its constituent pieces

samuela commented 2 years ago

Issue description

Right now cudatoolkit is a truly behemoth package with just about every possible CUDA tool under the sun. It is packaged by downloading the .run-file based installer, running it, and then copying out the results. This presents a few challenges:

The corresponding nar file is >2GB which breaks cache.nixos.org.
Knowing exactly what tools are included is sometimes unclear and requires changing things.

Proposal

Create separate packages for each of the tools, eg. cudatoolkit_11_6_cupti, cudatoolkit_11_6_nvcc, etc.

@kmittman from Nvidia kindly pointed me to https://developer.download.nvidia.com/compute/cuda/redist/ which would make packaging these pieces individually much easier.

TODO

This issue is intended to be a proposal for discussion/brainstorming how to best proceed. Some open questions in my mind:

[ ] How to best migrate existing packages onto the new format? Perhaps we could have a cudatoolkit_11_6 package that just combines a bunch of small ones to emulate the old behavior?
[ ] How to package cuDNN? It doesn't seem to be included in /redist/ anywhere.
[ ] /redist/ only includes releases for 11.0-11.6.1... but we should prob get rid of CUDA 10.x anyways.

cc @NixOS/cuda-maintainers @knedlsepp

FRidh commented 2 years ago

How to best migrate existing packages onto the new format? Perhaps we could have a cudatoolkit_11_6 package that just combines a bunch of small ones to emulate the old behavior?

We can try to generate a combined package but I think we can just keep the classic behemoth for the packages not yet migrated.

How to package cuDNN? It doesn't seem to be included in /redist/ anywhere.

What is the issue with the current cudnn packages we fetch? Are those too big as well?

samuela commented 2 years ago

We can try to generate a combined package but I think we can just keep the class behemoth for the packages not yet migrated.

Yeah, that's fine as well. Just so long as we have a way to migrate people over.

What is the issue with the current cudnn packages we fetch? Are those too big as well?

It is rather large and unwieldy. I just realized though that it's already packaged in a somewhat sensible way (expanding and patching a tgz file). See eg https://cs.github.com/NixOS/nixpkgs/blob/a8f938c15c84df4bef8e920fac71cd876188fa9e/pkgs/development/libraries/science/math/cudnn/generic.nix.

It would be really nice if we could make cudnn, cutensor, etc be sub-packages of cudatoolkitPackages. That way you'd always get a consistent package combo without having to go through the fuss of using things like "cudnn_8_1_cudatoolkit_10_2".

SomeoneSerge commented 2 years ago

By the way, I feel like I've been too often linking to this comment as a reference for the cudaPackages approach to ensuring consistency of cuda-cudnn versions between packages. Maybe it deserves a separate proposal issue

samuela commented 2 years ago

Mmm yeah that's not a bad idea... Do you have a workaround in mind that could resolve cudnn/cudatoolkit mismatches?

SomeoneSerge commented 2 years ago

What @FRidh describes in the second part of https://github.com/NixOS/nixpkgs/pull/166784#issuecomment-1086667289 I understand it this way:

# all-packages.nix

# remove the old cudaPackages (old different semantics: the result contains different versions of cuda)
# remove cutensorPackages
# remove cudnnPackages
# new semantics: the result contains a single version of cuda, a single version of cudnn, a single version of cutensor, all mutually compatible
cudaPackages = callPackage .../cuda-packages.nix { cudaVersion = "11.4"; cudnnVersion = "8.3"; cutensorVersion = "1.3.1.3"; };

And either

# .../pytorch/default.nix

{ ...
, cudaPackages
}: buildPythonPackage {
  # ...
  nativeBuildInputs = [
    # ...
    cudaPackages.nvcc
  ];
  buildInputs = [
    # ...
    cudaPackages.cudatoolkit
    cudaPackages.cudnn
  ];
}

# .../overlay.nix
final: prev: {
  # change default cudaPackages and thus rebuild pkgs.pytorch
  cudaPackages = prev.cudaPackages.override { ... };
  # or just make a custom build of pytorch
  myPytorch = prev.pytorch.override { cudaPackages = ...; };
}

Or .../pytorch/default.nix stays unchanged and

# python-packages.nix

pytorch = callPackageWith (pkgs // pkgs.cudaPackages) .../pytorch/default.nix { };

...in the former case the pytorch derivation gets too many inputs (complexity), in the second the overlay user suffers a bit

SomeoneSerge commented 2 years ago

Now that I posted this, I see it addresses the problem only partially, and doesn't really eliminate the need for assertions

FRidh commented 2 years ago

Draft in https://github.com/NixOS/nixpkgs/pull/167016.

SomeoneSerge commented 2 years ago

Just to keep track of this. This:

The corresponding nar file is >2GB which https://github.com/NixOS/nixos-org-configurations/issues/207

...is orthogonal to the current PR and will need to be addressed later.

Current status (./. refers to the checked out PR):

❯ nix path-info --impure --expr '(import <nixpkgs-unstable> { config.allowUnfree = true; }).cudatoolkit_11_5' -hs
querying info about missing paths/nix/store/qcf89ad9lgaipyy97mn9fdcimx40zn5g-cudatoolkit-11.5.0    4.0G
nixpkgs on  cudatoolkit-redist [$] via ❄️  impure (nix-shell)
❯ nix path-info --impure --expr '(import ./. { config.allowUnfree = true; }).cudatoolkit' -hs
querying info about missing paths/nix/store/4q5swpzp1qxbid4p02ksxlhi903ng0hv-cudatoolkit-11.5.0    4.0G

samuela commented 2 years ago

The corresponding nar file is >2GB which https://github.com/NixOS/nixos-org-configurations/issues/207

Once we get everyone off of cudaPackages.cudatoolkit and switched over to using the redist packages that issue ought to be resolved IIUC.

samuela commented 2 years ago

Done in https://github.com/NixOS/nixpkgs/pull/167016.

NixOS / nixpkgs