ComputeCanada / puppet-magic_castle

Puppet Environment repo for Magic Castle - https://github.com/ComputeCanada/magic_castle
MIT License
13 stars 21 forks source link

GPU support with the EESSI stack #301

Open ocaisa opened 9 months ago

ocaisa commented 9 months ago

For EESSI, we implemented GPU support for the stack and to access the drivers, it basically requires that someone runs the script https://github.com/EESSI/software-layer/blob/2023.06-software.eessi.io/scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh which does the symlinking, but allows for the potential use of CUDA compatibility libraries, and also places (a symlink to) the libraries in a trusted location for the Gentoo Prefix linker.

I'm not sure what the best approach here is, for the mentioned script to work, you already need a working CVMFS install. Could there be support for a post install script for CVMFS?

In the next EESSI release we would plan to add additional trusted locations for the linker (right now there is only one):

which would mean that this script would change a little, so rather than reproduce what it does, I'd like to be able to call it directly.

cmd-ntrf commented 9 months ago

I'm not sure what the best approach here is, for the mentioned script to work, you already need a working CVMFS install. Could there be support for a post install script for CVMFS?

There is already an example of an exec resource that runs after CVMFS is installed. https://github.com/ComputeCanada/puppet-magic_castle/blob/main/site/profile/manifests/cvmfs.pp#L129

We could do the same for EESSI's script. Should the script only run on nodes with a GPU? Or we should alway run it?

ocaisa commented 9 months ago

Right now, the script errors out if it cannot successfully execute

nvidia-smi --query-gpu=driver_version --format=csv,noheader

so, as it currently is, it should only run on nodes with a GPU. It also currently requires that you initialise EESSI.

Now that I've seen that both of these can raise an issue, I think I'd like to add an option to not throw errors and also to not require that EESSI is initialised (since I know the path to the script being called, I know the version of EESSI, so this is not actually necessary).