Open jabl opened 6 years ago
elrepo has previously released for example nvidia 340xx driver which when installed will stay in compatible version for rest of systems life. For some reason 390xx driver has not been released. http://elrepo.org/linux/elrepo/el7/x86_64/RPMS/
The elrepo 390xx is in elrepo-testing but I think it is not compatible here.
One possibility is to erase nvidia stuff: yum erase cuda-drivers xorg-x11-drv-nvidia xorg-x11-drv-nvidia-devel xorg-x11-drv-nvidia-gl xorg-x11-drv-nvidia-libs cuda nvidia-kmod
Then install: yum install cuda-drivers-390.30-1.x86_64 xorg-x11-drv-nvidia-390.30-1.el7.x86_64 xorg-x11-drv-nvidia-devel-390.30-1.el7.x86_64 xorg-x11-drv-nvidia-gl-390.30-1.el7.x86_64 xorg-x11-drv-nvidia-libs-390.30-1.el7.x86_64 cuda-9.1.85-1.x86_64 cuda-9-1-9.1.85-1.x86_64 cuda-demo-suite-9-1-9.1.85-1.x86_64 cuda-runtime-9-1-9.1.85-1.x86_64 nvidia-kmod-390.30
Then install versionlock plugin: yum install yum-plugin-versionlock
And lock it: yum versionlock cuda-drivers xorg-x11-drv-nvidia xorg-x11-drv-nvidia-devel xorg-x11-drv-nvidia-gl xorg-x11-drv-nvidia-libs cuda nvidia-kmod
Yeah, in the end what we did was to put in the group_vars for the affected nodes
kickstart_extra_post_commands: |
...
# for older systems with NVIDIA card fix the cuda version to 9.1
yum -y install yum-plugin-versionlock libibverbs
echo "1:nvidia-kmod-390.30-2.el7.*" >> /etc/yum/pluginconf.d/versionlock.list
echo "1:xorg-x11-drv-nvidia-390.30-1.el7.*" >> /etc/yum/pluginconf.d/versionlock.list
echo "1:xorg-x11-drv-nvidia-libs-390.30-1.el7.*" >> /etc/yum/pluginconf.d/versionlock.list
echo "1:xorg-x11-drv-nvidia-devel-390.30-1.el7.*" >> /etc/yum/pluginconf.d/versionlock.list
echo "1:xorg-x11-drv-nvidia-gl-390.30-1.el7.*" >> /etc/yum/pluginconf.d/versionlock.list
echo "0:cuda-drivers-390.30-1.*" >> /etc/yum/pluginconf.d/versionlock.list
echo "0:cuda-9.1.85-1.*" >> /etc/yum/pluginconf.d/versionlock.list
# install kmod so no extra reboot needed later as /dev/nvidia0 is found
if lspci|egrep -q '(M2090|M2070)'; then rpm -ivh http://10.10.254.20/nvidia-kmod-390.30-2.el7.x86_64.rpm; fi
Kludgy maybe, but got the job done.
On some of our (FGI-era) GPU nodes dmesg says:
We need to figure out how to support these nodes, perhaps fixing an older version of nvidia-kmod is enough?