Open SomeoneSerge opened 1 year ago
[ "8.6" "7.0+PTX" ]
work? Should we consider them?Potentially related: https://github.com/NixOS/nixpkgs/pull/220366#discussion_r1135048161
I think we want to default to the newest available capability. Users who need lower caps will encounter an error (rather than a cache-miss, thinking of cachix), admittedly after a substantial download. The error is also likely going to look confusing, but hopefully people would discover the change through github and discourse. We'd document the change and suggest they import nixpkgs with the capability they need. We'd build cache for all capabilities separately
CC @ConnorBaker @samuela
Newest as of 11.8 is Hopper (9.0).
They way it's set up currently we should be able to do that by taking the last element of the supported capabilities, or something similar, right?
EDIT: Should we choose versions that cause breakages? For example, the current version of torch won't work with anything newer than 8.6 (IIRC). Although, if we had packages which were aware of what we were requesting and then picked the newest architecture they could support instead of just erroring, I guess that would be easier?
current version of torch won't work with anything newer than 8.6
o_0 why?
o_0 why?
Oh, you know, just your standard hardcoded support 🙃
https://github.com/pytorch/pytorch/blob/v1.13.1/torch/utils/cpp_extension.py#L1751-L1753
https://github.com/pytorch/pytorch/blob/v1.13.1/torch/utils/cpp_extension.py#L1791-L1793
EDIT: Two paths I can think of:
Interested in your thoughts!
Patch it out (and whatever else is touched by that) so we can build for arbitrary architecture
This is more or less our default approach. Clearly, pytorch may not work with older capabilities (e.g. pytorch relying on newer instructions not supported by old hw), but the upper bound likely only means that they do not test 9.0
yet (nvidia dropping support for instructions shouldn't happen too often?)
@SomeoneSerge if you're going to patch it, would you also consider patching the check for compiler versions?
https://github.com/pytorch/pytorch/blob/v1.13.1/torch/utils/cpp_extension.py#L51-L71
Hmm I have a number of questions,
cudaSupport = false
and = true
. IIUC this suggestion would result in us having to support 1 + N flavors, since there may be drift between master, what maintainers build with, and what is cached on cuda-maintainers.cachix.org. Is that correct? How many possible variations are looking at?So I'm a little skeptical, but honestly this aspect of the codebase is not really my expertise so ofc open to hearing other perspectives.
To speak plainly, the main source of motivation is that the fat flavours of tf and torch are so heavy, they totally confuse any schedulers.
Longer version: the motivation is to optimize for easier maintenance. When we work on master, we don't usually run our test builds with all capabilities on, because it's so slow. E.g. I usually build only for 8.6
. In fact, I don't even need to deploy for more than 8.0
and 8.6
. Anything else extra gigabytes of storage and extra compute. When working a package that needs torch or tf, it's nice to have them cached. I also want to make people aware that the cudaCapabilities option exists (we need to stabilize the interface first ofc)
We already build several flavours (cuda, cuda+mkl, different sets of capabilities). What I propose is that we stop maintaing the "fat" flavour, which sets all caps on. We'd still build N smaller packages. There are benefits to that:
The fat flavour isn't useless. In fact, it's very desirable:
Fat builds are nice, but within our limited budget I think we should prioritize the smaller builds. If someone wants a binary cache for the fat flavour, there's the management work of gathering more resource they'll have to go through
UX-wise I just want to ensure that the default that people see when they copy and paste with (import <nixpkgs> {});
is not suddenly building tensorflow only to realize that it's not going to finish the same day
Another alternative that I like is to set cudaCapabilities to an error value, that would prompt users to override the option. Hwvr this goes against the nixpkgs policy of public attributes not showing warnings etc. Nixpkgs-review would need special treatment. Another similar approach: set broken=true unless exolicit cudaCapabilities provided by the user.
Ah gotcha, thanks for explaining @SomeoneSerge! I understand better why N smaller builds is attractive vs our 1 "fat" build now. Indeed, it would be very nice to maintain N builds, and not concern ourselves with fat builds at all.
IIUC what this really boils down to is UX challenges with cudaCapabilities
. In an ideal world we could magically dispatch to using the right 1-of-N package for the GPU at runtime. But making this a reality comes with UX challenges:
cudaCapabilities
option to users is tricky. Defaulting to a warnings, build failures, or meta.broken
status unless configured otherwise is unorthodox and sub-optimal UX. Defaulting to all capabilities is easy UX but bad for build times.This is maybe a crazy idea, but is there any way in which we could go from a N single-capability builds to a single "fat" build, without building the "fat" build from source all over again? If so, that might point towards a future in which we build N single-capability builds and then have the freedom to mix and match them cheaply...
Surfacing the cudaCapabilities option to users is tricky.
Maybe a solution will just present itself in time https://discourse.nixos.org/t/working-group-member-search-module-system-for-packages/26574
Ah very cool, I was not aware of that initiative!
Subgoals:
config.cudaCapabilities
cudaCapabilities
list possiblecudaCapabilities
list when working on master?7.5+PTX
buildcudaCapabilities
do not fail to discoverconfig.cudaCapabilities
Steps:
cudaCapabilities
.