SomeoneSerge / nixpkgs-cuda-ci

Building and caching nixpkgs with cudaSupport=true. We push to https://cuda-maintainers.cachix.org/
https://hercules-ci.com/github/SomeoneSerge/nixpkgs-cuda-ci
MIT License
22 stars 2 forks source link

Clarify intended usage #31

Open RuRo opened 2 months ago

RuRo commented 2 months ago

The README seems to suggest that adding cuda-maintainers.cachix.org as a substituter and setting allowUnfree = true and cudaSupport = true is sufficient to get the prebuilt packages. However, quite often I end up rebuild some of the CUDA-enabled packages after updating.

I have a few questions:

1) I currently have nixpkgs following github:nixos/nixpkgs/nixos-unstable in my flake and I run nix flake update nixpkgs every once in a while, but this seems like a bad strategy, because the CI might be lagging behind upstream and not every commit may be successfully built.

Is there some better way to only track the `nixos-unstable` commits that were succesfully built by `nixpkgs-cuda-ci`? The README links to the [hercules dashboard](https://hercules-ci.com/github/SomeoneSerge/nixpkgs-cuda-ci), but it's not clear how to get the desired information from that dashboard. It also looks like most jobs are failing for some reason.

2) The README mentions that

We build for different cuda architectures at a different frequencies, which means that to make use of the cache you might need to import nixpkgs as e.g. import { ...; config.cudaCapabilities = [ "8.6" ]; }. Cf. the flake for details

What are those "different frequencies" exactly?

3) nix/overlays.nix seems to also be optionally enabling MKL versions of LAPACK/BLAS.

Are these versions of packages also built in CI and if so, how often?

So, for example, if I set cudaCapabilities = [ "8.6" ] and enable the MKL the same way as your nix/overlay.nix, how can I determine the latest nixos-unstable commit that is already available in cuda-maintainers.cachix.org?

zopieux commented 1 week ago

I believe you've nailed those issues on the head, I also struggle answering those exact questions, making using the cache an exercise in frustration.

I have had some success asking around the NixOS CUDA Matrix room, however figuring out which package is successfully included in which nixpkgs-revision CI build seems basically impossible without some insider know-how. SomeoneSerge at a few occasions was kind enough to point me to the exact build that included what I needed, but I have yet to be able to reverse-engineer how this can be done as an end-user without bothering the maintainers.

SomeoneSerge commented 4 days ago

Hi! Sorry about the frustration. I've been spending less and less time on this repo. My idea for the next steps is roughly this:

  1. Confirm with the nix-community infra people that they're ready to publicly advertise their support of cuda
  2. Update the readme to point to https://hydra.nix-community.org/jobset/nixpkgs/cuda and https://nix-community.org/cache/
  3. Archive the present repo

P.S. @RuRo Sorry for the delayed response, I actually didn't get a notification for this issue o_0

RuRo commented 4 days ago

This sounds like a great development!

I still have one question, though: if/when this repo gets archived, what would be the appropriate place to discuss / report issues with the new nix-community CUDA cache/builders? For example, a lot of the questions in my original post would also apply to the nix-community cache:

Thanks.

SomeoneSerge commented 4 days ago

One venue would be #nix-community:nixos.org on matrix paralleled by https://github.com/nix-community/infra/issues on github. The nix-community hydra follows the nixos-unstable branch and builds its pkgs/top-level/release-cuda.nix file. That's where the list of packages and capabilities are controlled, and currently that only features the "all caps" variant for x86_64 and "all caps" sbsa (not jetson) for aarch64. This can be adjusted by opening a PR against Nixpkgs, but in coordination with the nix-community team because these changes might lead to dramatic impact in load on the community hydra's build servers, shared with projects other than the cuda cache

SomeoneSerge commented 4 days ago

Follow the links in https://github.com/NixOS/nixpkgs/pull/324379

SomeoneSerge commented 4 days ago

Also to answer the original questions, even though that's less relevant now:

nix/overlays.nix seems to also be optionally enabling MKL versions of LAPACK/BLAS.

Two ideas wrt the overlays were 1) to test non-default instances of packages (e.g. mpi or mkl support that was otherwise disabled), 2) to provide an executable instructions on how to get a cache-hit/a matching hash when enabling these optional features, since it's kind of like looking for a needle in a haystack...

Most parts of the overlays were over time merged into nixpkgs (some guarded behind config.cudaSupport) so the overlays became less relevant

What are those "different frequencies" exactly?

That used to be specified like so: https://github.com/SomeoneSerge/nixpkgs-cuda-ci/pull/14/files#diff-206b9ce276ab5971a2489d75eb1b12999d4bf3843b7988cbe8d687cfde61dea0L170

But then the onSchedule jobs were disabled because hercules kept on accumulating pending effects without ever running any, requiring that the queue be reset. Currently there's just a github action running updating the lock file from time to time and triggering the default job...