GoogleCloudPlatform / compute-gpu-installation

Apache License 2.0
77 stars 35 forks source link

installation fails on current default Debian (11) image on GCE #28

Closed sebbov closed 1 year ago

sebbov commented 1 year ago

Create VM with default Debian image, which resolves to debian-11-bullseye-v20230814. I included a T4, but it shouldn't matter.

Running the install script on first boot results in:

subprocess.CalledProcessError: Command '['apt', 'install', '-y', 'linux-headers-5.10.0-24-cloud-amd64', 'software-properties-common', 'pciutils', 'gcc', 'make', 'dkms']' returned non-zero exit status 100.

Full output below.

First seen on 8/23. Internally reported at b/297284811. Not a Debian issue per: https://bugs.debian.org/1050841. #26 was a partial fix which got reverted.

seb@seb-nvidia-test1:~$ curl https://raw.githubusercontent.com/GoogleCloudPlatform/compute-gpu-installation/main/linux/install_gpu_driver.py --output install_gpu_driver.py                                           
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 15702  100 15702    0     0   121k      0 --:--:-- --:--:-- --:--:--  122k
seb@seb-nvidia-test1:~$ sudo python3 install_gpu_driver.py
[2023-08-30 15:54:00] Executing: which nvidia-smi
[2023-08-30 15:54:00] Executing: uname -r
5.10.0-24-cloud-amd64
[2023-08-30 15:54:00] Executing: apt update
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
Get:1 https://packages.cloud.google.com/apt google-compute-engine-bullseye-stable InRelease [5146 B]
Get:2 https://packages.cloud.google.com/apt cloud-sdk-bullseye InRelease [6406 B]
Hit:3 https://deb.debian.org/debian bullseye InRelease
Get:4 https://deb.debian.org/debian-security bullseye-security InRelease [48.4 kB]
Get:5 https://packages.cloud.google.com/apt google-compute-engine-bullseye-stable/main amd64 Packages [1906 B]
Get:6 https://deb.debian.org/debian bullseye-updates InRelease [44.1 kB]
Get:7 https://deb.debian.org/debian bullseye-backports InRelease [49.0 kB]
Get:8 https://packages.cloud.google.com/apt cloud-sdk-bullseye/main amd64 Packages [350 kB]
Get:9 https://deb.debian.org/debian-security bullseye-security/main Sources [153 kB]
Get:10 https://deb.debian.org/debian-security bullseye-security/main amd64 Packages [245 kB]
Get:11 https://deb.debian.org/debian-security bullseye-security/main Translation-en [158 kB]
Get:12 https://deb.debian.org/debian bullseye-updates/main Sources.diff/Index [20.7 kB]
Get:13 https://deb.debian.org/debian bullseye-updates/main amd64 Packages.diff/Index [20.7 kB]
Get:14 https://deb.debian.org/debian bullseye-updates/main Translation-en.diff/Index [9483 B]
Get:15 https://deb.debian.org/debian bullseye-updates/main Sources T-2023-08-26-1408.20-F-2023-08-26-1408.20.pdiff [520 B]
Get:16 https://deb.debian.org/debian bullseye-updates/main amd64 Packages T-2023-08-26-1408.20-F-2023-08-26-1408.20.pdiff [464 B]
Get:15 https://deb.debian.org/debian bullseye-updates/main Sources T-2023-08-26-1408.20-F-2023-08-26-1408.20.pdiff [520 B]
Get:16 https://deb.debian.org/debian bullseye-updates/main amd64 Packages T-2023-08-26-1408.20-F-2023-08-26-1408.20.pdiff [464 B]
Get:17 https://deb.debian.org/debian bullseye-updates/main Translation-en T-2023-08-26-1408.20-F-2023-08-26-1408.20.pdiff [199 B]
Get:17 https://deb.debian.org/debian bullseye-updates/main Translation-en T-2023-08-26-1408.20-F-2023-08-26-1408.20.pdiff [199 B]
Get:18 https://deb.debian.org/debian bullseye-backports/main Sources.diff/Index [63.3 kB]
Get:19 https://deb.debian.org/debian bullseye-backports/main amd64 Packages.diff/Index [63.3 kB]
Get:20 https://deb.debian.org/debian bullseye-backports/main Translation-en.diff/Index [63.3 kB]
Get:21 https://deb.debian.org/debian bullseye-backports/main Sources T-2023-08-30-1418.39-F-2023-08-15-2006.40.pdiff [22.1 kB]
Get:21 https://deb.debian.org/debian bullseye-backports/main Sources T-2023-08-30-1418.39-F-2023-08-15-2006.40.pdiff [22.1 kB]
Get:22 https://deb.debian.org/debian bullseye-backports/main amd64 Packages T-2023-08-30-1418.39-F-2023-08-16-0804.46.pdiff [7658 B]
Get:22 https://deb.debian.org/debian bullseye-backports/main amd64 Packages T-2023-08-30-1418.39-F-2023-08-16-0804.46.pdiff [7658 B]
Get:23 https://deb.debian.org/debian bullseye-backports/main Translation-en T-2023-08-30-1418.39-F-2023-08-30-1418.39.pdiff [2483 B]
Get:23 https://deb.debian.org/debian bullseye-backports/main Translation-en T-2023-08-30-1418.39-F-2023-08-30-1418.39.pdiff [2483 B]
Fetched 1335 kB in 1s (1143 kB/s)
Reading package lists...
Building dependency tree...
Reading state information...
3 packages can be upgraded. Run 'apt list --upgradable' to see them.

[2023-08-30 15:54:02] Executing: apt install -y linux-headers-5.10.0-24-cloud-amd64 software-properties-common pciutils gcc make dkms

Failed with exception: Command '['apt', 'install', '-y', 'linux-headers-5.10.0-24-cloud-amd64', 'software-properties-common', 'pciutils', 'gcc', 'make', 'dkms']' returned non-zero exit status 100.
Traceback (most recent call last):
  File "/home/seb/install_gpu_driver.py", line 444, in <module>
    raise err
  File "/home/seb/install_gpu_driver.py", line 441, in <module>
    main()
  File "/home/seb/install_gpu_driver.py", line 436, in main
    install(args)
  File "/home/seb/install_gpu_driver.py", line 414, in install
    install_dependencies(system, version)
  File "/home/seb/install_gpu_driver.py", line 327, in install_dependencies
    reboot_flag = install_dependencies_debian_ubuntu(system, version)
  File "/home/seb/install_gpu_driver.py", line 294, in install_dependencies_debian_ubuntu
    run(f"apt install -y linux-headers-{kernel_version} "
  File "/home/seb/install_gpu_driver.py", line 130, in run
    proc = subprocess.run(shlex.split(command), check=check,
  File "/usr/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['apt', 'install', '-y', 'linux-headers-5.10.0-24-cloud-amd64', 'software-properties-common', 'pciutils', 'gcc', 'make', 'dkms']' returned non-zero exit status 100.

seb@seb-nvidia-test1:~$ cat /opt/google/gpu-installer/err.log 
[2023-08-30 15:54:00] Executing: which nvidia-smi

[2023-08-30 15:54:00] Executing: uname -r

[2023-08-30 15:54:00] Executing: apt update

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

[2023-08-30 15:54:02] Executing: apt install -y linux-headers-5.10.0-24-cloud-amd64 software-properties-common pciutils gcc make dkms

Failed with exception: Command '['apt', 'install', '-y', 'linux-headers-5.10.0-24-cloud-amd64', 'software-properties-common', 'pciutils', 'gcc', 'make', 'dkms']' returned non-zero exit status 100.
seb@seb-nvidia-test1:~$ cat /opt/google/gpu-installer/out.log 
[2023-08-30 15:54:00] Executing: which nvidia-smi

[2023-08-30 15:54:00] Executing: uname -r

5.10.0-24-cloud-amd64

[2023-08-30 15:54:00] Executing: apt update

Get:1 https://packages.cloud.google.com/apt google-compute-engine-bullseye-stable InRelease [5146 B]
Get:2 https://packages.cloud.google.com/apt cloud-sdk-bullseye InRelease [6406 B]
Hit:3 https://deb.debian.org/debian bullseye InRelease
Get:4 https://deb.debian.org/debian-security bullseye-security InRelease [48.4 kB]
Get:5 https://packages.cloud.google.com/apt google-compute-engine-bullseye-stable/main amd64 Packages [1906 B]
Get:6 https://deb.debian.org/debian bullseye-updates InRelease [44.1 kB]
Get:7 https://deb.debian.org/debian bullseye-backports InRelease [49.0 kB]
Get:8 https://packages.cloud.google.com/apt cloud-sdk-bullseye/main amd64 Packages [350 kB]
Get:9 https://deb.debian.org/debian-security bullseye-security/main Sources [153 kB]
Get:10 https://deb.debian.org/debian-security bullseye-security/main amd64 Packages [245 kB]
Get:11 https://deb.debian.org/debian-security bullseye-security/main Translation-en [158 kB]
Get:12 https://deb.debian.org/debian bullseye-updates/main Sources.diff/Index [20.7 kB]
Get:13 https://deb.debian.org/debian bullseye-updates/main amd64 Packages.diff/Index [20.7 kB]
Get:14 https://deb.debian.org/debian bullseye-updates/main Translation-en.diff/Index [9483 B]
Get:15 https://deb.debian.org/debian bullseye-updates/main Sources T-2023-08-26-1408.20-F-2023-08-26-1408.20.pdiff [520 B]
Get:16 https://deb.debian.org/debian bullseye-updates/main amd64 Packages T-2023-08-26-1408.20-F-2023-08-26-1408.20.pdiff [464 B]
Get:15 https://deb.debian.org/debian bullseye-updates/main Sources T-2023-08-26-1408.20-F-2023-08-26-1408.20.pdiff [520 B]
Get:16 https://deb.debian.org/debian bullseye-updates/main amd64 Packages T-2023-08-26-1408.20-F-2023-08-26-1408.20.pdiff [464 B]
Get:17 https://deb.debian.org/debian bullseye-updates/main Translation-en T-2023-08-26-1408.20-F-2023-08-26-1408.20.pdiff [199 B]
Get:17 https://deb.debian.org/debian bullseye-updates/main Translation-en T-2023-08-26-1408.20-F-2023-08-26-1408.20.pdiff [199 B]
Get:18 https://deb.debian.org/debian bullseye-backports/main Sources.diff/Index [63.3 kB]
Get:19 https://deb.debian.org/debian bullseye-backports/main amd64 Packages.diff/Index [63.3 kB]
Get:20 https://deb.debian.org/debian bullseye-backports/main Translation-en.diff/Index [63.3 kB]
Get:21 https://deb.debian.org/debian bullseye-backports/main Sources T-2023-08-30-1418.39-F-2023-08-15-2006.40.pdiff [22.1 kB]
Get:21 https://deb.debian.org/debian bullseye-backports/main Sources T-2023-08-30-1418.39-F-2023-08-15-2006.40.pdiff [22.1 kB]
Get:22 https://deb.debian.org/debian bullseye-backports/main amd64 Packages T-2023-08-30-1418.39-F-2023-08-16-0804.46.pdiff [7658 B]
Get:22 https://deb.debian.org/debian bullseye-backports/main amd64 Packages T-2023-08-30-1418.39-F-2023-08-16-0804.46.pdiff [7658 B]
Get:23 https://deb.debian.org/debian bullseye-backports/main Translation-en T-2023-08-30-1418.39-F-2023-08-30-1418.39.pdiff [2483 B]
Get:23 https://deb.debian.org/debian bullseye-backports/main Translation-en T-2023-08-30-1418.39-F-2023-08-30-1418.39.pdiff [2483 B]
Fetched 1335 kB in 1s (1143 kB/s)
Reading package lists...
Building dependency tree...
Reading state information...
3 packages can be upgraded. Run 'apt list --upgradable' to see them.

[2023-08-30 15:54:02] Executing: apt install -y linux-headers-5.10.0-24-cloud-amd64 software-properties-common pciutils gcc make dkms
m-strzelczyk commented 1 year ago

It seems I'll have to make the script reboot the system when it detects that there's a kernel update going on. Once kernel is updated and headers are installed, the installation works good. It's already doing reboots for some operating systems, so it should be that much of a deal. Just makes the whole process slower.

I'll probably have an update ready tomorrow.

wenyhu-google commented 1 year ago

Hi @m-strzelczyk thanks for your quick fix. However, this still blocks my team because our team does not support "reboot" while GPU driver installation.

Do you happen to know is there any good way to let Debian update the kernel without rebooting? Appreciated!

m-strzelczyk commented 1 year ago

I'm afraid it's impossible to update the kernel without restarting the system.

This script always had an option to reboot systems for driver installation, it didn't happen before for Debian or Ubuntu, because it wasn't necessary. Unfortunately it is now :< Perhaps once GCP releases new base boot image that has the newest kernel, we won't have to reboot once more.