ComputeCanada / puppet-magic_castle

Puppet Environment repo for Magic Castle - https://github.com/ComputeCanada/magic_castle
MIT License
13 stars 21 forks source link

GPU install broken on almalinux 8 #379

Closed poquirion closed 1 month ago

poquirion commented 2 months ago

Magic Castle: 13.5.0 image : "AlmaLinux-8.9-x64-2023-11" In compute node with gpu:

[centos@gpu-a100-401 ~]$ journalctl  -fu puppet
-- Logs begin at Wed 2024-09-11 12:41:37 UTC. --
Sep 11 12:52:47 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[1332]: (/Stage[main]/Prometheus::Node_exporter/Prometheus::Daemon[node_exporter]/Service[node_exporter]/ensure) ensure changed 'stopped' to 'running'
Sep 11 12:52:47 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[1332]: (/Stage[main]/Profile::Slurm::Node/Service[slurmd]) Skipping because of failed dependencies
Sep 11 12:52:47 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[1332]: (/Stage[main]/Profile::Slurm::Node/Exec[systemctl restart slurmd]) Skipping because of failed dependencies
Sep 11 12:52:47 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[1332]: Applied catalog in 640.01 seconds
Sep 11 13:04:07 gpu-a100-401.int.corbeau.dev-sd4h.ca systemd[1]: Stopping Puppet agent...
Sep 11 13:04:07 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[1284]: Caught TERM; exiting
Sep 11 13:04:07 gpu-a100-401.int.corbeau.dev-sd4h.ca systemd[1]: puppet.service: Succeeded.
Sep 11 13:04:07 gpu-a100-401.int.corbeau.dev-sd4h.ca systemd[1]: Stopped Puppet agent.
Sep 11 13:04:07 gpu-a100-401.int.corbeau.dev-sd4h.ca systemd[1]: Started Puppet agent.
Sep 11 13:04:08 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50043]: Starting Puppet client version 7.28.0
Sep 11 13:04:13 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: Requesting catalog from mgmt1:8140 (172.16.19.79)
Sep 11 13:04:22 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: Execution of '/usr/bin/dnf -d 0 -e 1 -y install kmod-nvidia-latest-dkms' returned 1: Error: Unable to find a match: kmod-nvidia-latest-dkms
Sep 11 13:04:22 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: (/Stage[main]/Profile::Gpu::Install::Passthrough/Package[kmod-nvidia-latest-dkms]/ensure) change from 'purged' to 'present' failed: Execution of '/usr/bin/dnf -d 0 -e 1 -y install kmod-nvidia-latest-dkms' returned 1: Error: Unable to find a match: kmod-nvidia-latest-dkms
Sep 11 13:04:22 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: (/Stage[main]/Profile::Gpu::Config::Mig/Package[nvidia-mig-manager]) Dependency Package[kmod-nvidia-latest-dkms] has failures: true
Sep 11 13:04:22 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: (/Stage[main]/Profile::Gpu::Config::Mig/Package[nvidia-mig-manager]) Skipping because of failed dependencies
Sep 11 13:04:22 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: (/Stage[main]/Profile::Gpu::Config::Mig/Service[nvidia-mig-manager]) Skipping because of failed dependencies
Sep 11 13:04:22 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: (/Stage[main]/Profile::Gpu::Config::Mig/File[/etc/nvidia-mig-manager/puppet-config.yaml]) Skipping because of failed dependencies
Sep 11 13:04:22 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: (/Stage[main]/Profile::Gpu::Config::Mig/File_line[nvidia-persistenced.service]) Skipping because of failed dependencies
Sep 11 13:04:22 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: (/Stage[main]/Profile::Gpu::Config::Mig/File[/etc/nvidia-mig-manager/puppet-hooks.yaml]) Skipping because of failed dependencies
Sep 11 13:04:22 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: (/Stage[main]/Profile::Gpu::Config::Mig/Exec[nvidia-mig-parted apply]) Skipping because of failed dependencies
Sep 11 13:04:22 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: (/Stage[main]/Profile::Gpu::Install::Passthrough/Package[datacenter-gpu-manager]) Skipping because of failed dependencies
Sep 11 13:04:22 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: (/Stage[main]/Profile::Gpu::Install::Passthrough/File[/run/nvidia-persistenced]) Skipping because of failed dependencies
Sep 11 13:04:22 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: (/Stage[main]/Profile::Gpu::Install::Passthrough/Augeas[nvidia-persistenced.service]) Skipping because of failed dependencies
Sep 11 13:04:22 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: (/Stage[main]/Profile::Gpu::Install/Exec[dkms_nvidia]) Skipping because of failed dependencies
Sep 11 13:04:22 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: (/Stage[main]/Profile::Gpu::Install/Exec[nvidia-symlink]) Skipping because of failed dependencies
Sep 11 13:04:23 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: (/Stage[main]/Profile::Gpu::Install/Kmod::Load[nvidia]/Exec[modprobe nvidia]) Skipping because of failed dependencies
Sep 11 13:04:23 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: (/Stage[main]/Profile::Gpu::Install/Kmod::Load[nvidia]/File[/etc/modules-load.d/nvidia.conf]) Skipping because of failed dependencies
Sep 11 13:04:23 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: (/Stage[main]/Profile::Gpu::Install/Kmod::Load[nvidia_drm]/Exec[modprobe nvidia_drm]) Skipping because of failed dependencies
Sep 11 13:04:23 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: (/Stage[main]/Profile::Gpu::Install/Kmod::Load[nvidia_drm]/File[/etc/modules-load.d/nvidia_drm.conf]) Skipping because of failed dependencies
Sep 11 13:04:23 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: (/Stage[main]/Profile::Gpu::Install/Kmod::Load[nvidia_modeset]/Exec[modprobe nvidia_modeset]) Skipping because of failed dependencies
Sep 11 13:04:23 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: (/Stage[main]/Profile::Gpu::Install/Kmod::Load[nvidia_modeset]/File[/etc/modules-load.d/nvidia_modeset.conf]) Skipping because of failed dependencies
Sep 11 13:04:23 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: (/Stage[main]/Profile::Gpu::Install/Kmod::Load[nvidia_uvm]/Exec[modprobe nvidia_uvm]) Skipping because of failed dependencies
Sep 11 13:04:23 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: (/Stage[main]/Profile::Gpu::Install/Kmod::Load[nvidia_uvm]/File[/etc/modules-load.d/nvidia_uvm.conf]) Skipping because of failed dependencies
Sep 11 13:04:23 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: (/Stage[main]/Profile::Gpu/Service[nvidia-persistenced]) Skipping because of failed dependencies
Sep 11 13:04:23 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: (/Stage[main]/Profile::Gpu/Service[nvidia-dcgm]) Skipping because of failed dependencies
Sep 11 13:04:23 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: (/Stage[main]/Profile::Slurm::Node/Exec[slurm-nvidia_gres]) Skipping because of failed dependencies
Sep 11 13:04:23 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: (/Stage[main]/Profile::Slurm::Node/Service[slurmd]) Skipping because of failed dependencies
Sep 11 13:04:23 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: (/Stage[main]/Profile::Slurm::Node/Exec[systemctl restart slurmd]) Skipping because of failed dependencies
Sep 11 13:04:23 gpu-a100-401.int.corbeau.dev-sd4h.ca puppet-agent[50044]: Applied catalog in 6.90 seconds

If I look for the package I find one with a similar name:

[centos@gpu-a100-401 ~]$ dnf search kmod-nvidia
Repository 'ceph-stable' is missing name in configuration, using id.
Last metadata expiration check: 0:20:24 ago on Wed 11 Sep 2024 12:50:04 PM UTC.
========================================================================================================= Name Matched: kmod-nvidia ==========================================================================================================
kmod-nvidia-open-dkms.x86_64 : NVIDIA driver open kernel module flavor

Maybe kmod-nvidia-latest-dkms.x86_64 in on almalinux9?

poquirion commented 2 months ago

I get the same error on Almalinux9 :(

$  journalctl   -fu puppet
[...]
Sep 11 14:06:50 gpu-a100-40-1.int.corbeau.dev-sd4h.ca puppet-agent[1077]: Execution of '/usr/bin/dnf -d 0 -e 1 -y install kmod-nvidia-latest-dkms' returned 1: Error: Unable to find a match: kmod-nvidia-latest-dkms
Sep 11 14:06:50 gpu-a100-40-1.int.corbeau.dev-sd4h.ca puppet-agent[1077]: (/Stage[main]/Profile::Gpu::Install::Passthrough/Package[kmod-nvidia-latest-dkms]/ensure) change from 'purged' to 'present' failed: Execution of '/usr/bin/dnf -d 0 -e 1 -y install kmod-nvidia-latest-dkms' returned 1: Error: Unable to find a match: kmod-nvidia-latest-dkms
[...]
poquirion commented 2 months ago

I get the same thing with Rocky-9.3-x64-2023-11. :cry:

mboisson commented 1 month ago

This looks like https://github.com/ComputeCanada/puppet-magic_castle/issues/374

poquirion commented 1 month ago

Got if fix by cherry picking gpu update in https://github.com/ComputeCanada/puppet-magic_castle/releases/tag/14.0.0-beta