NVIDIA / ansible-role-nvidia-driver

BSD 3-Clause "New" or "Revised" License
117 stars 67 forks source link

install from cuda repo fails #60

Closed KasperSkytte closed 2 years ago

KasperSkytte commented 2 years ago

When setting nvidia_driver_ubuntu_install_from_cuda_repo: yes I get:

TASK [nvidia.nvidia_driver : install driver packages] ****************************************************
fatal: [nodename]: FAILED! => {"changed": false, "msg": "No package matching 'cuda-drivers-510' is available"}

The package doesn't exist:

$ apt-cache search ^cuda-drivers
cuda-drivers-fabricmanager-450 - Meta-package for FM and Driver
cuda-drivers-fabricmanager-460 - Transitional package for cuda-drivers-fabricmanager-510
cuda-drivers-fabricmanager-470 - Meta-package for FM and Driver
cuda-drivers-fabricmanager-510 - Meta-package for FM and Driver

No issues when setting nvidia_driver_ubuntu_install_from_cuda_repo: no, but I would like to install from the CUDA repository though.

KasperSkytte commented 2 years ago

Newest version of the ansible role as of an hour or so ago. Target node Ubuntu version 20.04 LTS.

BarthV commented 2 years ago

same problem here

ajdecon commented 2 years ago

I can confirm I'm seeing the same thing. The driver role code hasn't changed since the last successful test, but this looks like an issue in the upstream repo. I'll check with the repo maintainers.

ajdecon commented 2 years ago

Confirmed there was an issue with the upstream apt package repository, and this should now be fixed.

Tested via triggering a CI run and confirming that all install paths are successful: https://github.com/NVIDIA/ansible-role-nvidia-driver/runs/6314471044

Please make sure to run apt-get update before attempting a new install. Closing this based on the successful test, but feel free to re-open this issue if you see the issue persist.

KasperSkytte commented 2 years ago

Thanks for the quick response, but I still get the same error. Even after reinstalling the role and updating package info with sudo apt-get update. Fresh Ubuntu focal VM, everything default.

KasperSkytte commented 2 years ago

I have no permission to re-open the issue

ajdecon commented 2 years ago

Oops! Re-opening the issue myself and will kick off another test on my end. (both through CI and on a local VM)

ajdecon commented 2 years ago

@KasperSkytte : Both the CI tests and my local VM tests are successfully installing the driver from the CUDA repos. This worked on all of Ubuntu 18.04, Ubuntu 20.04, and CentOS 7.

Can you confirm if you are still seeing this issue?

If so -- can you please do the following to help troubleshoot?

  1. On a fresh VM, run the role and provide a gist with your full log
  2. On a VM with the updated repos installed, please run sudo apt-get update and sudo apt-get install cuda-drivers-510 manually, and provide a gist with the full log
BarthV commented 2 years ago

on my side things are settling down. driver + docker ansible roles are now doing their jobs on a bare new setup, everything is installed from cuda repo with success.

ex. on a machine with 460 drivers :

$ apt-cache policy nvidia-driver-460 
nvidia-driver-460:
  Installed: 460.106.00-0ubuntu1
  Candidate: 460.106.00-0ubuntu1
  Version table:
     470.103.01-0ubuntu0.20.04.1 500
        500 http://fr.archive.ubuntu.com/ubuntu focal-updates/restricted amd64 Packages
        500 http://fr.archive.ubuntu.com/ubuntu focal-security/restricted amd64 Packages
 *** 460.106.00-0ubuntu1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
        100 /var/lib/dpkg/status
     460.91.03-0ubuntu1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     460.73.01-0ubuntu1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     460.32.03-0ubuntu1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     460.27.04-0ubuntu1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages

I think the main problem is dealing with machines where the previous version of this role was running. In this case i'm afraid :

A better care of these cases should at least raise a warning or error in this role. The best thing to do is maybe to make a "cleaning" task to make sure that every old parts of this role, nv repos & nv pat keys are truely cleaned beforehand

BarthV commented 2 years ago

I can help by giving the system state and files created by the "old" setup method.

KasperSkytte commented 2 years ago

@ajdecon Thank you for being so persistent. I got it working now too. Turns out my "fresh" VM wasn't so fresh. Was confusing myself with multiple ones. Works out of the box now on Ubuntu 20. I'm closing again.

KasperSkytte commented 2 years ago

Agree with @BarthV that a few tasks to clean up from any other method(s) to install nvidia driver would be handy.

KurtAhn commented 2 years ago

I was also struggling to install the driver with the Ansible role and faced this exact same issue. So, I followed the installation guide. Because I wasn't able to install the cuda-keyring package, I followed the alternative steps described in 3.8.3.2, and this allowed me to install the driver. For good measure, I re-ran Ansible after this, and the driver installed successfully. I hope this is useful to someone.