Adding NVIDIA GPU support

ocaisa commented 1 year ago

There have been a number of issues and PRs to date related to this, but we now need to get this in order and bring all those efforts up to date. There's the updated task list for supporting NVIDIAs GPUs:

[x] Support CUDA installations under host_injections subdirectory with the build bot (and for end users). WIP with https://github.com/EESSI/software-layer/pull/368
[x] Support "standard" CUDA installation but with hook to strip out everything not in the runtime and replace them by symlinks to the CUDA installation under host_injections (WIP with #381)
[x] Install CUDAsamples to verify CUDA compilation with this approach (WIP with #381)
[ ] Support driver libraries in multiple locations (via CUDA compat libraries, via links to host libraries, and under /.singularity.d/libs so our linker also works within containers). This requires updates to the ld.config that we ship for our linker. The relevant libraries are listed within https://github.com/apptainer/apptainer/blob/main/etc/nvliblist.conf
- [ ] Add p7zip to support unpacking RPMs (optional now that we have permission to ship the CUDA compatibility libraries under the CUDA EULA)

Sabryr commented 1 year ago

Hello @ocaisa thank you very much for your effort. We had a discussion at Univ of Oslo, with @terjekv and few others. Do you have a summary of restrictions we have when distributing NVidia libraries, specially CUDA runtime. We have a meeting with some top NVIdia people and we can bring this to their attention.

ocaisa commented 1 year ago

We've already had a discussion with them around this. We have a specific plan here where we parse the EULA to figure out what we can ship, everything else we strip out replacing it by a symlink to a special location. We assume that what is listed in the EULA is sufficient for the runtime (and that seems to be the case so far). For other cases (like when using the CUDA compiler), we have a script that reinstalls CUDA in that special location unbreaking all the symlinks. It might be a little clearer with the PR I hope to make today.

ocaisa commented 1 year ago

When the symlinks are unbroken, there is no difference to a typical installation (except that the non-runtime parts are actually local)

boegel commented 11 months ago

Some progress here:

410 was merged, which is an additional step towards supporting GPUs in software.eessi.io;
- this PR introduced some problems/mistakes, some of which in the `` script;
PR #434 is a follow-up of that PR, which:
- fixes the problems introduced in #410;
- fixes a couple of additional problems that popped up during testing;
- should allow us to get to a point really soon where:
- we have CUDA (v12.1.1) installed in software.eessi.io, in which files that are not in the EULA whitelist have been stripped out and replaced by symbolic links into host_injections;
- we have CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1 installed for all CPU targets;
TODO before merging #434:
- [x] make sure that the installation of CUDA in both host_injections and in software layer + installation of CUDA-Samples on top works;
- [x] make some minor changes in the post_sanitycheck_cuda hook, mainly to get better logging on what gets included in the CUDA installation, and what is stripped out because it's not whitelisted;
- [x] build CUDA-Samples for all CPU targets, deploy those installations in software.eessi.io/versions/2023.06, and merge the PR
TODO after merging #434:
- [x] decide on location to ship gpu_support scripts in EESSI repository (.../versions/2023.06/scripts/gpu_support?) + make necessary changes in follow-up PR (already done in #434);
- [x] update https://eessi.io/docs/gpu (done, see https://github.com/EESSI/docs/pull/138)
- [ ] create node images for aarch64 and x86_64 that include GPU drivers;
- [ ] re-configure bot to support node types with a GPU (cfr. https://aws.amazon.com/ec2/instance-types/#Accelerated_Computing);
- [ ] figure out how to deal with not having a GPU instance available for all CPU targets (like neoverse_v1);

EESSI / software-layer

Adding NVIDIA GPU support #375

410 was merged, which is an additional step towards supporting GPUs in `software.eessi.io`;

EESSI / software-layer

Adding NVIDIA GPU support #375

410 was merged, which is an additional step towards supporting GPUs in software.eessi.io;

410 was merged, which is an additional step towards supporting GPUs in `software.eessi.io`;