NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
17.57k stars 13.73k forks source link

Change default nixpkgs' cudaCapabilities #221564

Open SomeoneSerge opened 1 year ago

SomeoneSerge commented 1 year ago

Subgoals:

  1. cuda-maintainers.cachix.org caches nixpkgs imported without custom config.cudaCapabilities
  2. cuda-maintainers.cachix.org caches master at a higher frequency, but with a shortest cudaCapabilities list possible
  3. Builds are "reasonably cheap"
  4. CUDA maintainers use the same short cudaCapabilities list when working on master?
  5. Users are unlikely to accidentally use packages built with wrong capabilities. E.g. A100 users don't end up using a 7.5+PTX build
  6. Users that after the change would need to override cudaCapabilities do not fail to discover config.cudaCapabilities

Steps:

  1. [ ] Who uses CUDA in nixpkgs and how? A NixOS Discourse post with a multiple choice poll, listing possible combinations of cudaCapabilities.
    1. [ ] Ask about MKL and alternative MPI implementations
    2. [ ] Link to from https://matrix.to/#/#datascience:nixos.org
    3. [ ] Reference https://github.com/Nix-QChem/NixOS-QChem users
    4. [ ] Reference https://github.com/numtide/nixpkgs-unfree
    5. [ ] If there's a new nixpkgs-unfree instance from Domen Kozar, reference that
    6. [ ] Link https://discourse.nixos.org/t/improving-nixos-data-science-infrastructure-ci-for-mkl-cuda/5074/11
    7. [ ] Link https://discourse.nixos.org/t/why-is-the-nix-compiled-python-slower/18717/20?u=sergek
    8. [ ] Link https://discourse.nixos.org/t/petition-to-build-and-cache-unfree-packages-on-cache-nixos-org/17440
SomeoneSerge commented 1 year ago
ConnorBaker commented 1 year ago

Potentially related: https://github.com/NixOS/nixpkgs/pull/220366#discussion_r1135048161

SomeoneSerge commented 1 year ago

I think we want to default to the newest available capability. Users who need lower caps will encounter an error (rather than a cache-miss, thinking of cachix), admittedly after a substantial download. The error is also likely going to look confusing, but hopefully people would discover the change through github and discourse. We'd document the change and suggest they import nixpkgs with the capability they need. We'd build cache for all capabilities separately

CC @ConnorBaker @samuela

ConnorBaker commented 1 year ago

Newest as of 11.8 is Hopper (9.0).

They way it's set up currently we should be able to do that by taking the last element of the supported capabilities, or something similar, right?

EDIT: Should we choose versions that cause breakages? For example, the current version of torch won't work with anything newer than 8.6 (IIRC). Although, if we had packages which were aware of what we were requesting and then picked the newest architecture they could support instead of just erroring, I guess that would be easier?

SomeoneSerge commented 1 year ago

current version of torch won't work with anything newer than 8.6

o_0 why?

ConnorBaker commented 1 year ago

o_0 why?

Oh, you know, just your standard hardcoded support 🙃

https://github.com/pytorch/pytorch/blob/v1.13.1/torch/utils/cpp_extension.py#L1751-L1753

https://github.com/pytorch/pytorch/blob/v1.13.1/torch/utils/cpp_extension.py#L1791-L1793

EDIT: Two paths I can think of:

  1. Patch it out (and whatever else is touched by that) so we can build for arbitrary architectures.
  2. Have cudaPackages.cudaFlags provide a function which, when given a list of supported capabilities (provided by a package), selects the arch closest to what the target arch is.

Interested in your thoughts!

SomeoneSerge commented 1 year ago

Patch it out (and whatever else is touched by that) so we can build for arbitrary architecture

This is more or less our default approach. Clearly, pytorch may not work with older capabilities (e.g. pytorch relying on newer instructions not supported by old hw), but the upper bound likely only means that they do not test 9.0 yet (nvidia dropping support for instructions shouldn't happen too often?)

ConnorBaker commented 1 year ago

@SomeoneSerge if you're going to patch it, would you also consider patching the check for compiler versions?

https://github.com/pytorch/pytorch/blob/v1.13.1/torch/utils/cpp_extension.py#L51-L71

samuela commented 1 year ago

Hmm I have a number of questions,

So I'm a little skeptical, but honestly this aspect of the codebase is not really my expertise so ofc open to hearing other perspectives.

SomeoneSerge commented 1 year ago

To speak plainly, the main source of motivation is that the fat flavours of tf and torch are so heavy, they totally confuse any schedulers.

Longer version: the motivation is to optimize for easier maintenance. When we work on master, we don't usually run our test builds with all capabilities on, because it's so slow. E.g. I usually build only for 8.6. In fact, I don't even need to deploy for more than 8.0 and 8.6. Anything else extra gigabytes of storage and extra compute. When working a package that needs torch or tf, it's nice to have them cached. I also want to make people aware that the cudaCapabilities option exists (we need to stabilize the interface first ofc)

We already build several flavours (cuda, cuda+mkl, different sets of capabilities). What I propose is that we stop maintaing the "fat" flavour, which sets all caps on. We'd still build N smaller packages. There are benefits to that:

The fat flavour isn't useless. In fact, it's very desirable:

Fat builds are nice, but within our limited budget I think we should prioritize the smaller builds. If someone wants a binary cache for the fat flavour, there's the management work of gathering more resource they'll have to go through

UX-wise I just want to ensure that the default that people see when they copy and paste with (import <nixpkgs> {}); is not suddenly building tensorflow only to realize that it's not going to finish the same day

Another alternative that I like is to set cudaCapabilities to an error value, that would prompt users to override the option. Hwvr this goes against the nixpkgs policy of public attributes not showing warnings etc. Nixpkgs-review would need special treatment. Another similar approach: set broken=true unless exolicit cudaCapabilities provided by the user.

samuela commented 1 year ago

Ah gotcha, thanks for explaining @SomeoneSerge! I understand better why N smaller builds is attractive vs our 1 "fat" build now. Indeed, it would be very nice to maintain N builds, and not concern ourselves with fat builds at all.

IIUC what this really boils down to is UX challenges with cudaCapabilities. In an ideal world we could magically dispatch to using the right 1-of-N package for the GPU at runtime. But making this a reality comes with UX challenges:

This is maybe a crazy idea, but is there any way in which we could go from a N single-capability builds to a single "fat" build, without building the "fat" build from source all over again? If so, that might point towards a future in which we build N single-capability builds and then have the freedom to mix and match them cheaply...

SomeoneSerge commented 1 year ago

Surfacing the cudaCapabilities option to users is tricky.

Maybe a solution will just present itself in time https://discourse.nixos.org/t/working-group-member-search-module-system-for-packages/26574

samuela commented 1 year ago

Ah very cool, I was not aware of that initiative!