fgci-org / ansible-role-cuda

Installs cuda
MIT License
30 stars 23 forks source link

Need to support nvidia legacy driver #16

Open jabl opened 6 years ago

jabl commented 6 years ago

On some of our (FGI-era) GPU nodes dmesg says:

[15280.304871] NVRM: The NVIDIA Tesla M2070 GPU installed in this system is
NVRM:  supported through the NVIDIA 390.xx Legacy drivers. Please
NVRM:  visit http://www.nvidia.com/object/unix.html for more
NVRM:  information.  The 396.26 NVIDIA driver will ignore
NVRM:  this GPU.  Continuing probe...

We need to figure out how to support these nodes, perhaps fixing an older version of nvidia-kmod is enough?

VilleS1 commented 6 years ago

elrepo has previously released for example nvidia 340xx driver which when installed will stay in compatible version for rest of systems life. For some reason 390xx driver has not been released. http://elrepo.org/linux/elrepo/el7/x86_64/RPMS/

VilleS1 commented 6 years ago

The elrepo 390xx is in elrepo-testing but I think it is not compatible here.

One possibility is to erase nvidia stuff: yum erase cuda-drivers xorg-x11-drv-nvidia xorg-x11-drv-nvidia-devel xorg-x11-drv-nvidia-gl xorg-x11-drv-nvidia-libs cuda nvidia-kmod

Then install: yum install cuda-drivers-390.30-1.x86_64 xorg-x11-drv-nvidia-390.30-1.el7.x86_64 xorg-x11-drv-nvidia-devel-390.30-1.el7.x86_64 xorg-x11-drv-nvidia-gl-390.30-1.el7.x86_64 xorg-x11-drv-nvidia-libs-390.30-1.el7.x86_64 cuda-9.1.85-1.x86_64 cuda-9-1-9.1.85-1.x86_64 cuda-demo-suite-9-1-9.1.85-1.x86_64 cuda-runtime-9-1-9.1.85-1.x86_64 nvidia-kmod-390.30

Then install versionlock plugin: yum install yum-plugin-versionlock

And lock it: yum versionlock cuda-drivers xorg-x11-drv-nvidia xorg-x11-drv-nvidia-devel xorg-x11-drv-nvidia-gl xorg-x11-drv-nvidia-libs cuda nvidia-kmod

jabl commented 6 years ago

Yeah, in the end what we did was to put in the group_vars for the affected nodes

kickstart_extra_post_commands: |
  ...
  # for older systems with NVIDIA card fix the cuda version to 9.1 
  yum -y install yum-plugin-versionlock libibverbs
  echo "1:nvidia-kmod-390.30-2.el7.*" >> /etc/yum/pluginconf.d/versionlock.list
  echo "1:xorg-x11-drv-nvidia-390.30-1.el7.*" >> /etc/yum/pluginconf.d/versionlock.list
  echo "1:xorg-x11-drv-nvidia-libs-390.30-1.el7.*" >> /etc/yum/pluginconf.d/versionlock.list
  echo "1:xorg-x11-drv-nvidia-devel-390.30-1.el7.*" >> /etc/yum/pluginconf.d/versionlock.list
  echo "1:xorg-x11-drv-nvidia-gl-390.30-1.el7.*" >> /etc/yum/pluginconf.d/versionlock.list
  echo "0:cuda-drivers-390.30-1.*" >> /etc/yum/pluginconf.d/versionlock.list
  echo "0:cuda-9.1.85-1.*" >> /etc/yum/pluginconf.d/versionlock.list
  # install kmod so no extra reboot needed later as /dev/nvidia0 is found
  if lspci|egrep -q '(M2090|M2070)'; then rpm -ivh http://10.10.254.20/nvidia-kmod-390.30-2.el7.x86_64.rpm; fi

Kludgy maybe, but got the job done.