Change default nixpkgs' cudaCapabilities

SomeoneSerge commented 1 year ago

Subgoals:

cuda-maintainers.cachix.org caches nixpkgs imported without custom config.cudaCapabilities
cuda-maintainers.cachix.org caches master at a higher frequency, but with a shortest cudaCapabilities list possible
Builds are "reasonably cheap"
CUDA maintainers use the same short cudaCapabilities list when working on master?
Users are unlikely to accidentally use packages built with wrong capabilities. E.g. A100 users don't end up using a 7.5+PTX build
Users that after the change would need to override cudaCapabilities do not fail to discover config.cudaCapabilities

Steps:

[ ] Who uses CUDA in nixpkgs and how? A NixOS Discourse post with a multiple choice poll, listing possible combinations of cudaCapabilities.
1. [ ] Ask about MKL and alternative MPI implementations
2. [ ] Link to from https://matrix.to/#/#datascience:nixos.org
3. [ ] Reference https://github.com/Nix-QChem/NixOS-QChem users
4. [ ] Reference https://github.com/numtide/nixpkgs-unfree
5. [ ] If there's a new nixpkgs-unfree instance from Domen Kozar, reference that
6. [ ] Link https://discourse.nixos.org/t/improving-nixos-data-science-infrastructure-ci-for-mkl-cuda/5074/11
7. [ ] Link https://discourse.nixos.org/t/why-is-the-nix-compiled-python-slower/18717/20?u=sergek
8. [ ] Link https://discourse.nixos.org/t/petition-to-build-and-cache-unfree-packages-on-cache-nixos-org/17440

SomeoneSerge commented 1 year ago

[ ] Do builds like [ "8.6" "7.0+PTX" ] work? Should we consider them?

ConnorBaker commented 1 year ago

SomeoneSerge commented 1 year ago

I think we want to default to the newest available capability. Users who need lower caps will encounter an error (rather than a cache-miss, thinking of cachix), admittedly after a substantial download. The error is also likely going to look confusing, but hopefully people would discover the change through github and discourse. We'd document the change and suggest they import nixpkgs with the capability they need. We'd build cache for all capabilities separately

CC @ConnorBaker @samuela

ConnorBaker commented 1 year ago

Newest as of 11.8 is Hopper (9.0).

They way it's set up currently we should be able to do that by taking the last element of the supported capabilities, or something similar, right?

EDIT: Should we choose versions that cause breakages? For example, the current version of torch won't work with anything newer than 8.6 (IIRC). Although, if we had packages which were aware of what we were requesting and then picked the newest architecture they could support instead of just erroring, I guess that would be easier?

SomeoneSerge commented 1 year ago

current version of torch won't work with anything newer than 8.6

o_0 why?

ConnorBaker commented 1 year ago

o_0 why?

Oh, you know, just your standard hardcoded support 🙃

https://github.com/pytorch/pytorch/blob/v1.13.1/torch/utils/cpp_extension.py#L1751-L1753

https://github.com/pytorch/pytorch/blob/v1.13.1/torch/utils/cpp_extension.py#L1791-L1793

EDIT: Two paths I can think of:

Patch it out (and whatever else is touched by that) so we can build for arbitrary architectures.
Have cudaPackages.cudaFlags provide a function which, when given a list of supported capabilities (provided by a package), selects the arch closest to what the target arch is.

Interested in your thoughts!

SomeoneSerge commented 1 year ago

Patch it out (and whatever else is touched by that) so we can build for arbitrary architecture

This is more or less our default approach. Clearly, pytorch may not work with older capabilities (e.g. pytorch relying on newer instructions not supported by old hw), but the upper bound likely only means that they do not test 9.0 yet (nvidia dropping support for instructions shouldn't happen too often?)

ConnorBaker commented 1 year ago

@SomeoneSerge if you're going to patch it, would you also consider patching the check for compiler versions?

https://github.com/pytorch/pytorch/blob/v1.13.1/torch/utils/cpp_extension.py#L51-L71

samuela commented 1 year ago

Hmm I have a number of questions,

What is the motivation for this change? What aspects of the current CUDA experience are we hoping to improve? I'm guessing it's a compute/download thing, but I would argue that UX is more important. Giving users an experience that "just works" should be of paramount importance IMHO. Power users can always build and cache their own packages if necessary.
We currently have to maintain two flavors of packages: cudaSupport = false and = true. IIUC this suggestion would result in us having to support 1 + N flavors, since there may be drift between master, what maintainers build with, and what is cached on cuda-maintainers.cachix.org. Is that correct? How many possible variations are looking at?
Concretely which capabilities are we considering removing/adding support for? If there are capabilities that are very old, eg for hardware that is >=5 years old, then I say yeah let's totally get rid of them.
If cuda-maintainers.cachix.org caches packages for a capability set that differs from master, what impact will this have on UX?

So I'm a little skeptical, but honestly this aspect of the codebase is not really my expertise so ofc open to hearing other perspectives.

SomeoneSerge commented 1 year ago

To speak plainly, the main source of motivation is that the fat flavours of tf and torch are so heavy, they totally confuse any schedulers.

Longer version: the motivation is to optimize for easier maintenance. When we work on master, we don't usually run our test builds with all capabilities on, because it's so slow. E.g. I usually build only for 8.6. In fact, I don't even need to deploy for more than 8.0 and 8.6. Anything else extra gigabytes of storage and extra compute. When working a package that needs torch or tf, it's nice to have them cached. I also want to make people aware that the cudaCapabilities option exists (we need to stabilize the interface first ofc)

We already build several flavours (cuda, cuda+mkl, different sets of capabilities). What I propose is that we stop maintaing the "fat" flavour, which sets all caps on. We'd still build N smaller packages. There are benefits to that:

less drift between git and cachix, because the builds are more incremental: the fat tf build may take up to 10 hours, but the first "thin flavour" can become available in some 3 hours (numbers off the top of my head so don't trust them)
CI robustness: several fat builds runnung in parallel (fat tf + fat pytorch) on the same machine tend to cause throttling, whereas multiple small builds are easier to rate-limit
better parallelism: multiple small builds are easier to distribute over several build nodes if/when we find those

The fat flavour isn't useless. In fact, it's very desirable:

You need it if you don't know (or won't bother to figure out) what environment you're going to run in. E.g. it's convenient to use on a cluster
It is a friendly default for nixpkgs newcomers

Fat builds are nice, but within our limited budget I think we should prioritize the smaller builds. If someone wants a binary cache for the fat flavour, there's the management work of gathering more resource they'll have to go through

UX-wise I just want to ensure that the default that people see when they copy and paste with (import <nixpkgs> {}); is not suddenly building tensorflow only to realize that it's not going to finish the same day

Another alternative that I like is to set cudaCapabilities to an error value, that would prompt users to override the option. Hwvr this goes against the nixpkgs policy of public attributes not showing warnings etc. Nixpkgs-review would need special treatment. Another similar approach: set broken=true unless exolicit cudaCapabilities provided by the user.

samuela commented 1 year ago

Ah gotcha, thanks for explaining @SomeoneSerge! I understand better why N smaller builds is attractive vs our 1 "fat" build now. Indeed, it would be very nice to maintain N builds, and not concern ourselves with fat builds at all.

IIUC what this really boils down to is UX challenges with cudaCapabilities. In an ideal world we could magically dispatch to using the right 1-of-N package for the GPU at runtime. But making this a reality comes with UX challenges:

Dispatching at runtime is likely impossible(?). Asking users to set capabilities at build time is annoying, esp. in cloud/cluster environments with varying GPU types, public code releases, etc.
Surfacing the cudaCapabilities option to users is tricky. Defaulting to a warnings, build failures, or meta.broken status unless configured otherwise is unorthodox and sub-optimal UX. Defaulting to all capabilities is easy UX but bad for build times.

This is maybe a crazy idea, but is there any way in which we could go from a N single-capability builds to a single "fat" build, without building the "fat" build from source all over again? If so, that might point towards a future in which we build N single-capability builds and then have the freedom to mix and match them cheaply...

SomeoneSerge commented 1 year ago

Surfacing the cudaCapabilities option to users is tricky.

Maybe a solution will just present itself in time https://discourse.nixos.org/t/working-group-member-search-module-system-for-packages/26574

samuela commented 1 year ago

Ah very cool, I was not aware of that initiative!

NixOS / nixpkgs

Change default nixpkgs' cudaCapabilities #221564