NVIDIA / ansible-role-nvidia-driver

BSD 3-Clause "New" or "Revised" License
117 stars 67 forks source link

Support for Debian #67

Open fuog opened 1 year ago

fuog commented 1 year ago

Hi there,

i wanted to thank you for the nice ansible role. Unfortunately Debian does not seem to be officially supported. But I managed it with a bit of variable overriding. I would be happy if debian would be officially supported. Until then, maybe this will help someone who uses debian to use this role anyway.

# my playbook
.....
  roles:
    - role: unix-basics
      tags: unix-basics
    - role: xanmanning.k3s
      tags: k3s
    - role: nvidia.nvidia_driver  # should run after cluster install
      vars:
        # See https://github.com/NVIDIA/ansible-role-nvidia-driver#role-variables
        nvidia_driver_ubuntu_cuda_repo_baseurl: 'https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64'  # enforced 'debian11'
        nvidia_driver_ubuntu_install_from_cuda_repo: yes
        nvidia_driver_persistence_mode_on: yes
        ansible_distribution: Ubuntu  # forcing in to the ubuntu part of the role
      when: ansible_hostname == 'k3s-worker1'  # we only have ONE node with NVIDIA
      tags:
        - nvidia
....
Uzurka commented 1 year ago

+1, debian support would be a great thing :D

Uzurka commented 1 year ago

Hey Fuog, I tried your "bypass" today and encounteered this error :


redirecting (type: modules) ansible.builtin.kernel_blacklist to community.general.kernel_blacklist
redirecting (type: modules) community.general.kernel_blacklist to community.general.system.kernel_blacklist
redirecting (type: modules) ansible.builtin.kernel_blacklist to community.general.kernel_blacklist
redirecting (type: modules) community.general.kernel_blacklist to community.general.system.kernel_blacklist
Using module file /home/ludo/.local/lib/python3.8/site-packages/ansible_collections/community/general/plugins/modules/system/kernel_blacklist.py
Pipelining is enabled.
<192.168.1.2> ESTABLISH SSH CONNECTION FOR USER: root
<192.168.1.2> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o 'IdentityFile="/home/ludo/ssh/id_rsa"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o 'ControlPath="/home/ludo/.ansible/cp/f1b6b591d3"' 192.168.1.2 '/bin/sh -c '"'"'/usr/bin/python3 && sleep 0'"'"''
<192.168.1.2> (1, b'\n{"path": "/tmp/tmpla68agn1", "details": "Error while setting attributes: /tmp/tmpla68agn1: Operation not supported\\n", "exception": "Traceback (most recent call last):\\n  File \\"/tmp/ansible_kernel_blacklist_payload_lwuic3yg/ansible_kernel_blacklist_payload.zip/ansible/module_utils/basic.py\\", line 1003, in set_attributes_if_different\\n    raise Exception(\\"Error while setting attributes: %s\\" % (out + err))\\nException: Error while setting attributes: /tmp/tmpla68agn1: Operation not supported\\n\\n", "failed": true, "msg": "chattr failed", "uid": 0, "gid": 0, "owner": "root", "group": "root", "mode": "0644", "state": "file", "size": 0, "invocation": {"module_args": {"name": "nouveau", "state": "present", "blacklist_file": "/etc/modprobe.d/blacklist-ansible.conf"}}}\n', b'')
<192.168.1.2> Failed to connect to the host via ssh: 
The full traceback is:
Traceback (most recent call last):
  File "/tmp/ansible_kernel_blacklist_payload_lwuic3yg/ansible_kernel_blacklist_payload.zip/ansible/module_utils/basic.py", line 1003, in set_attributes_if_different
    raise Exception("Error while setting attributes: %s" % (out + err))
Exception: Error while setting attributes: /tmp/tmpla68agn1: Operation not supported

fatal: [openmediavault]: FAILED! => {
    "changed": false,
    "details": "Error while setting attributes: /tmp/tmpla68agn1: Operation not supported\n",
    "gid": 0,
    "group": "root",
    "invocation": {
        "module_args": {
            "blacklist_file": "/etc/modprobe.d/blacklist-ansible.conf",
            "name": "nouveau",
            "state": "present"
        }
    },
    "mode": "0644",
    "msg": "chattr failed",
    "owner": "root",
    "path": "/tmp/tmpla68agn1",
    "size": 0,
    "state": "file",
    "uid": 0
}```

Any idea of why ? 
Zorlin commented 1 year ago

Hey Fuog, I tried your "bypass" today and encounteered this error :

Any idea of why ?

Try installing the acl package.

Uzurka commented 1 year ago

I surrendered using this role as long as Nvidia don't update it to support Debian, which is one of the most used distrib for server. Anyway, i installed & update my driver using those tasks :

    - name: Add contrib & non-free repository
      replace:
        dest: /etc/apt/sources.list
        regexp: '^(deb(?!.* contrib).*)'
        replace: '\1 contrib non-free'
      notify: Apt cache update
      tags: nvidia

    - name: Installer les pilotes Nvidia
      apt:
        name: nvidia-driver
        autoremove: false
        dpkg_options: 'force-confnew'
      environment:
        DEBIAN_FRONTEND: noninteractive
      tags: nvidia

    - name: Installation des dépendances
      apt:
        update_cache: true
        name:
          - gnupg
          - build-essential
          - dirmngr
          - mariadb-server
          - docker-compose
          - docker-compose-plugin
          - python3-pymysql
          - nvidia-smi
          - nvidia-container-toolkit
          - nvidia-container-runtime
          - nvidia-docker2
      tags: dependancies 

The dependancies contains quite all i need for my server, including so nvidia-docker and nvidia-container packages Everything works fine with it

jpellman commented 1 year ago

Just as a caveat for those trying to use @fuog 's solution:

While install-redhat.yml has a task to install Linux kernel headers (see here), install-ubuntu.yml does not have an equivalent task. If you want the NVIDIA driver to install/compile properly under Debian, you will also need to install linux-headers-{{ ansible_kernel }} in a separate task somewhere.