alan-turing-institute / data-safe-haven

https://data-safe-haven.readthedocs.io
BSD 3-Clause "New" or "Revised" License
61 stars 15 forks source link

Use ubuntu-drivers to install Nvidia drivers #2089

Closed JimMadge closed 3 months ago

JimMadge commented 3 months ago

:white_check_mark: Checklist

:vertical_traffic_light: Depends on

:arrow_heading_up: Summary

:closed_umbrella: Related issues

Closes #2042

:microscope: Tests

github-actions[bot] commented 3 months ago

Coverage report

This PR does not seem to contain any modification to coverable code.

JimMadge commented 3 months ago

The package is available for jammy and noble (at the moment we only really support jammy).

craddm commented 3 months ago
Aug 06 11:03:23 shm-green-sre-moocow-vm-workspace-02 desired_state.sh[4946]: changed: [localhost] => (item={'name': 'pycharm-community', 'classic': True})
Aug 06 11:03:23 shm-green-sre-moocow-vm-workspace-02 desired_state.sh[4946]: TASK [Check for Nvidia drivers] ************************************************
Aug 06 11:03:24 shm-green-sre-moocow-vm-workspace-02 python3[37261]: ansible-ansible.builtin.apt Invoked with name=nvidia-utils state=present package=['nvidia-utils'] update_cache_retries=5 update_cache_retry_max_delay=12 cache_valid_time=0 purge=False force=False dpkg_options=force-confdef,force-confold autoremove=False autoclean=False only_upgrade=False force_apt_get=False allow_unauthenticated=False update_cache=None deb=None default_release=None install_recommends=None upgrade=None policy_rc_d=None
[ 1542.124465] desired_state.sh[4946]: fatal: [localhost]: FAILED! => {"cache_update_time": 1722941566, "cache_updated": false, "changed": false, "msg": "'/usr/bin/apt-get -y -o \"Dpkg::Options::=--force-confdef\" -o \"Dpkg::Options::=--force-confold\"     --simulate install 'nvidia-utils'' failed: E: Package 'nvidia-utils' has no installation candidate\n", "rc": 100, "stderr": "E: Package 'nvidia-utils' has no installation candidate\n", "stderr_lines": ["E: Package 'nvidia-utils' has no installation candidate"], "stdout": "Reading package lists...\nBuilding dependency tree...\nReading state information...\nPackage nvidia-utils is a virtual package provided by:\n  nvidia-utils-550-server 550.90.07-0ubuntu0.22.04.1\n  nvidia-utils-550 550.90.07-0ubuntu0.22.04.1\n  nvidia-utils-545 545.29.06-0ubuntu0.22.04.2\n  nvidia-utils-535-server 535.183.01-0ubuntu0.22.04.1\n  nvidia-utils-535 535.183.01-0ubuntu0.22.04.1\n  nvidia-utils-470-server 470.256.02-0ubuntu0.22.04.1\n  nvidia-utils-470 470.256.02-0ubuntu0.22.04.1\n  nvidia-utils-450-server 450.248.02-0ubuntu0.22.04.1\n  nvidia-utils-418-server 418.226.00-0ubuntu5~0.22.04.1\n  nvidia-utils-390 390.157-0ubuntu0.22.04.2\n\n", "stdout_lines": ["Reading package lists...", "Building dependency tree...", "Reading state information...", "Package nvidia-utils is a virtual package provided by:", "  nvidia-utils-550-server 550.90.07-0ubuntu0.22.04.1", "  nvidia-utils-550 550.90.07-0ubuntu0.22.04.1", "  nvidia-utils-545 545.29.06-0ubuntu0.22.04.2", "  nvidia-utils-535-server 535.183.01-0ubuntu0.22.04.1", "  nvidia-utils-535 535.183.01-0ubuntu0.22.04.1", "  nvidia-utils-470-server 470.256.02-0ubuntu0.22.04.1", "  nvidia-utils-470 470.256.02-0ubuntu0.22.04.1", "  nvidia-utils-450-server 450.248.02-0ubuntu0.22.04.1", "  nvidia-utils-418-server 418.226.00-0ubuntu5~0.22.04.1", "  nvidia-utils-390 390.157-0ubuntu0.22.04.2", ""]}
[ 1542.132494] desired_state.sh[4946]: PLAY RECAP *********************************************************************
[ 1542.138688] desired_state.sh[4946]: localhost                  : ok=5    changed=2    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0
[ 1542.382324] desired_state.sh[4945]: /
Aug 06 11:03:25 shm-green-sre-moocow-vm-workspace-02 desired_state.sh[4946]: fatal: [localhost]: FAILED! => {"cache_update_time": 1722941566, "cache_updated": false, "changed": false, "msg": "'/usr/bin/apt-get -y -o \"Dpkg::Options::=--force-confdef\" -o \"Dpkg::Options::=--force-confold\"     --simulate install 'nvidia-utils'' failed: E: Package 'nvidia-utils' has no installation candidate\n", "rc": 100, "stderr": "E: Package 'nvidia-utils' has no installation candidate\n", "stderr_lines": ["E: Package 'nvidia-utils' has no installation candidate"], "stdout": "Reading package lists...\nBuilding dependency tree...\nReading state information...\nPackage nvidia-utils is a virtual package provided by:\n  nvidia-utils-550-server 550.90.07-0ubuntu0.22.04.1\n  nvidia-utils-550 550.90.07-0ubuntu0.22.04.1\n  nvidia-utils-545 545.29.06-0ubuntu0.22.04.2\n  nvidia-utils-535-server 535.183.01-0ubuntu0.22.04.1\n  nvidia-utils-535 535.183.01-0ubuntu0.22.04.1\n  nvidia-utils-470-server 470.256.02-0ubuntu0.22.04.1\n  nvidia-utils-470 470.256.02-0ubuntu0.22.04.1\n  nvidia-utils-450-server 450.248.02-0ubuntu0.22.04.1\n  nvidia-utils-418-server 418.226.00-0ubuntu5~0.22.04.1\n  nvidia-utils-390 390.157-0ubuntu0.22.04.2\n\n", "stdout_lines": ["Reading package lists...", "Building dependency tree...", "Reading state information...", "Package nvidia-utils is a virtual package provided by:", "  nvidia-utils-550-server 550.90.07-0ubuntu0.22.04.1", "  nvidia-utils-550 550.90.07-0ubuntu0.22.04.1", "  nvidia-utils-545 545.29.06-0ubuntu0.22.04.2", "  nvidia-utils-535-server 535.183.01-0ubuntu0.22.04.1", "  nvidia-utils-535 535.183.01-0ubuntu0.22.04.1", "  nvidia-utils-470-server 470.256.02-0ubuntu0.22.04.1", "  nvidia-utils-470 470.256.02-0ubuntu0.22.04.1", "  nvidia-utils-450-server 450.248.02-0ubuntu0.22.04.1", "  nvidia-utils-418-server 418.226.00-0ubuntu5~0.22.04.1", "  nvidia-utils-390 390.157-0ubuntu0.22.04.2", ""]}
jemrobinson commented 3 months ago

@craddm : I think ignoring this (expected) error makes the playbook run to completion.

JimMadge commented 3 months ago

That should work, feels a bit hacky. An alternative could be to use ansible.builtin.stat

craddm commented 3 months ago

testing now

JimMadge commented 3 months ago

Pushed a commit for the stat option, we can try both.

JimMadge commented 3 months ago

@craddm By the way, you don't need to redeploy to test. Just upload the next desired state stuff to the blob container.

JimMadge commented 3 months ago

Or... just use creates: 🤔

JimMadge commented 3 months ago

Working at 9bab3f0

With nvidia-utils

PLAY [Desired state configuration] *********************************************

TASK [Gathering Facts] *********************************************************
ok: [localhost]

TASK [Use ubuntu-drivers to install Nvidia drivers] ****************************
ok: [localhost]

without

PLAY [Desired state configuration] *********************************************

TASK [Gathering Facts] *********************************************************
ok: [localhost]

TASK [Use ubuntu-drivers to install Nvidia drivers] ****************************
changed: [localhost]

PLAY RECAP *********************************************************************
localhost                  : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0  
craddm commented 3 months ago

I was just trying to test your fix, @jemrobinson, but if @JimMadge's works then...

jemrobinson commented 3 months ago

@craddm : Mine did work (for me) but @JimMadge's made about 3 more versions since then :)

craddm commented 3 months ago

@craddm By the way, you don't need to redeploy to test. Just upload the next desired state stuff to the blob container.

Once I've done that, how do I get the state to update?

JimMadge commented 3 months ago

@craddm You can run the desired-state service, the /root/desired_state.sh script or run ansible-playbook.

craddm commented 3 months ago

Ok, well James's fix worked for me too, but sounds like yours is the way to go

JimMadge commented 3 months ago

The current HEAD worked on my deployment (see here).

It isn't idempotent (or really, it will always report change when there is no driver to install), but that might be something we want to address later.

JimMadge commented 3 months ago

@craddm Do you want to test and/or leave a review?