ComputeCanada / puppet-magic_castle

Puppet Environment repo for Magic Castle - https://github.com/ComputeCanada/magic_castle
MIT License
13 stars 21 forks source link

NVIDIA driver installation #374

Closed SebastianAchilles closed 3 months ago

SebastianAchilles commented 3 months ago

When I try to create a compute instance with a NVIDIA GPU using Rocky 9.4 I get the following error:

Execution of '/usr/bin/dnf -d 0 -e 1 -y install kmod-nvidia-latest-dkms' returned 1: Error: Unable to find a match: kmod-nvidia-latest-dkms
(/Stage[main]/Profile::Gpu::Install::Passthrough/Package[kmod-nvidia-latest-dkms]/ensure) change from 'purged' to 'present' failed: Execution of '/usr/bin/dnf -d 0 -e 1 -y install kmod-nvidia-latest-dkms' returned 1: Error: Unable to find a match: kmod-nvidia-latest-dkms

So the package kmod-nvidia-latest-dkms, specified here https://github.com/ComputeCanada/puppet-magic_castle/blob/main/data/common.yaml#L297 was not found. I think this is caused, because the default changed from latest-dkms to open-dkms:

$ dnf module list nvidia-driver

Name                          Stream                           Profiles                                 Summary                                             
nvidia-driver                 latest                           default [d], fm, ks, src                 Nvidia driver for latest branch                     
nvidia-driver                 latest-dkms                      default [d], fm, ks                      Nvidia driver for latest-dkms branch                
nvidia-driver                 open-dkms [d][e]                 default [d], fm, ks, src                 Nvidia driver for open-dkms branch

The latest-dkms stream needs to be enabled in order to be able to install kmod-nvidia-latest-dkms.

I see two solutions:

  1. switch to the open-dkms
  2. enable the latest-dkms stream
cmd-ntrf commented 3 months ago

Thank you for reporting this.

We are working on a fix as we speak https://github.com/ComputeCanada/puppet-magic_castle/pull/373.

cmd-ntrf commented 3 months ago

The issue should now be fixed in the main branch. We will work on releasing by the end of the week.

SebastianAchilles commented 3 months ago

Thanks a lot for the quick answer. I have tested https://github.com/ComputeCanada/puppet-magic_castle/pull/373 and installing the nvidia driver worked.

cmd-ntrf commented 3 months ago

Bug is not entirely fixed apparently. Puppet tries to enable the nvidia-driver stream at each run.

Aug 23 12:39:01 nodegpu1.int.xxxxxx puppet-agent[28499]: Requesting catalog from mgmt1:8140 (172.16.5.58)
Aug 23 12:39:09 nodegpu1.int.xxxxxx puppet-agent[28499]: (/Stage[main]/Profile::Gpu::Install::Passthrough/Package[nvidia-stream]/ensure) created (corrective)
Aug 23 12:39:14 nodegpu1.int.xxxxxx puppet-agent[28499]: (/Stage[main]/Profile::Gpu::Install/Exec[nvidia-symlink]) Triggered 'refresh' from 1 event
Aug 23 12:39:14 nodegpu1.int.xxxxxx puppet-agent[28499]: (/Stage[main]/Profile::Slurm::Node/Exec[slurm-nvidia_gres]) Triggered 'refresh' from 1 event
Aug 23 12:39:15 nodegpu1.int.xxxxxx puppet-agent[28499]: (/Stage[main]/Profile::Slurm::Node/Service[slurmd]) Triggered 'refresh' from 1 event
Aug 23 12:39:15 nodegpu1.int.xxxxxx puppet-agent[28499]: Applied catalog in 11.11 seconds

I suspect this is an issue cause by the naming of the nvidia-driver stream package.

We are working on a fix.

cmd-ntrf commented 3 months ago

Should now be fixed.