Closed SebastianAchilles closed 3 months ago
Thank you for reporting this.
We are working on a fix as we speak https://github.com/ComputeCanada/puppet-magic_castle/pull/373.
The issue should now be fixed in the main branch. We will work on releasing by the end of the week.
Thanks a lot for the quick answer. I have tested https://github.com/ComputeCanada/puppet-magic_castle/pull/373 and installing the nvidia driver worked.
Bug is not entirely fixed apparently. Puppet tries to enable the nvidia-driver stream at each run.
Aug 23 12:39:01 nodegpu1.int.xxxxxx puppet-agent[28499]: Requesting catalog from mgmt1:8140 (172.16.5.58)
Aug 23 12:39:09 nodegpu1.int.xxxxxx puppet-agent[28499]: (/Stage[main]/Profile::Gpu::Install::Passthrough/Package[nvidia-stream]/ensure) created (corrective)
Aug 23 12:39:14 nodegpu1.int.xxxxxx puppet-agent[28499]: (/Stage[main]/Profile::Gpu::Install/Exec[nvidia-symlink]) Triggered 'refresh' from 1 event
Aug 23 12:39:14 nodegpu1.int.xxxxxx puppet-agent[28499]: (/Stage[main]/Profile::Slurm::Node/Exec[slurm-nvidia_gres]) Triggered 'refresh' from 1 event
Aug 23 12:39:15 nodegpu1.int.xxxxxx puppet-agent[28499]: (/Stage[main]/Profile::Slurm::Node/Service[slurmd]) Triggered 'refresh' from 1 event
Aug 23 12:39:15 nodegpu1.int.xxxxxx puppet-agent[28499]: Applied catalog in 11.11 seconds
I suspect this is an issue cause by the naming of the nvidia-driver stream package.
We are working on a fix.
Should now be fixed.
When I try to create a compute instance with a NVIDIA GPU using Rocky 9.4 I get the following error:
So the package
kmod-nvidia-latest-dkms
, specified here https://github.com/ComputeCanada/puppet-magic_castle/blob/main/data/common.yaml#L297 was not found. I think this is caused, because the default changed fromlatest-dkms
toopen-dkms
:The
latest-dkms
stream needs to be enabled in order to be able to installkmod-nvidia-latest-dkms
.I see two solutions:
open-dkms
latest-dkms
stream