GoogleCloudPlatform / compute-gpu-installation

Apache License 2.0
77 stars 35 forks source link

Not found drivers on instance restart #10

Closed JulesBelveze closed 2 years ago

JulesBelveze commented 2 years ago

Describe the bug I am working with a A100 and have installed the NVIDIA driver using your script install_gpu_driver.py and everything works smoothly.

However, I am experiencing a strange behaviour when I stop the instance and restart it. The error message is the following

>>> nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

and if I try to re-run the install_gpu_driver.py script then it says that the drivers are already installed...

I have googled around and some people suggest to wait a moment before connecting to the instance; but that didn't solve my problem.

It has happened to me quite a lot of time and I've tried with different OS (Ubuntu and Debian) but this seems to be a recurring problem. Any idea why this occurs?

Environment

m-strzelczyk commented 2 years ago

Hi!

Thanks for reporting this problem. It seems that it's related more to the driver itself than the installation script that we have here, but I'll try to help anyway :)

Just to make sure, the Debian and Ubuntu images you use are the default public ones available in GCP, right?

Since this is A100 instance, there's only one machine type you can pick a2-highgpu-1g - but can you tell me in which zone are you running your instances? I just want to be able to replicate the issue as closely as possible.

Is there anything else that's not default about your instance? Did you use Secure Boot?

m-strzelczyk commented 2 years ago

Also, could you provide your dmesg or at least dmesg | grep -i nvidia?

JulesBelveze commented 2 years ago

Hi @m-strzelczyk thanks for your help!

Just to make sure, the Debian and Ubuntu images you use are the default public ones available in GCP, right?

Yes I am using the default images

Since this is A100 instance, there's only one machine type you can pick a2-highgpu-1g - but can you tell me in which zone are you running your instances?

The machine type is indeed a2-highgpu-1g and I'm running my instances in europe-west4-a

Is there anything else that's not default about your instance? Did you use Secure Boot?

The instances I'm using are preemptible (dunno if this can help)

m-strzelczyk commented 2 years ago

@JulesBelveze I was not able to replicate your issue by creating a preemptible A100 equiped machine in us-central1, I'll try again in europe-west4-a like you did.

In the meantime, could you please share your dmesg output and lsmod output? This should tell us if the NVIDIA drivers are loading at all.

JulesBelveze commented 2 years ago

I've deleted the old instances (with which this issue occurred) and by turning off/on the instance I currently have I can't reproduce it either.

I'll ping you and share the commands output with you as soon as the error happens again.

JulesBelveze commented 2 years ago

@m-strzelczyk closing it for now as I'm not able to reproduce it. Will re-open if this occurs again, sorry for that

JulesBelveze commented 2 years ago

Hey @m-strzelczyk, it finally occurred again! Here's the output of the commands you asked for, let me know if you need anything else 😃

>>> dmesg | grep -i nvidia
[    4.122300] audit: type=1400 audit(1652363138.199:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=496 comm="apparmor_parser"
[    4.122306] audit: type=1400 audit(1652363138.199:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=496 comm="apparmor_parser"

>>> lsmod
Module                  Size  Used by
nls_iso8859_1          16384  1
dm_multipath           40960  0
scsi_dh_rdac           16384  0
scsi_dh_emc            16384  0
scsi_dh_alua           20480  0
crct10dif_pclmul       16384  1
crc32_pclmul           16384  0
ghash_clmulni_intel    16384  0
aesni_intel           376832  0
virtio_net             57344  0
net_failover           20480  1 virtio_net
failover               16384  1 net_failover
crypto_simd            16384  1 aesni_intel
cryptd                 24576  2 crypto_simd,ghash_clmulni_intel
input_leds             16384  0
psmouse               155648  0
serio_raw              20480  0
efi_pstore             16384  0
sch_fq_codel           20480  13
drm                   557056  0
virtio_rng             16384  0
ip_tables              32768  0
x_tables               49152  1 ip_tables
autofs4                45056  2

Dunno if this could be related but the instance got preempted last time I used it.

m-strzelczyk commented 2 years ago

Hi Jules!

Thanks for the new info :) If you still have the disk around, could you send me the log files from /opt/google/gpu-installer/ dir? The installation script should be logging everything to files in this directory, so it should tell us what happened with the installation. Thanks!

JulesBelveze commented 2 years ago

Here you go @m-strzelczyk 😃

err.log ``` [2022-05-05 12:34:36] Executing: which nvidia-smi [2022-05-05 12:34:36] Executing: uname -r [2022-05-05 12:34:36] Executing: apt update WARNING: apt does not have a stable CLI interface. Use with caution in scripts. [2022-05-05 12:34:45] Executing: apt install -y linux-headers-5.13.0-1024-gcp software-properties-common pciutils gcc make WARNING: apt does not have a stable CLI interface. Use with caution in scripts. [2022-05-05 12:35:02] Executing: lspci -n [2022-05-05 12:35:02] Executing: lspci -n [2022-05-05 12:35:02] Executing: curl -fSsl -O https://us.download.nvidia.com/XFree86/Linux-x86_64/495.46/NVIDIA-Linux-x86_64-495.46.run [2022-05-05 12:35:03] Executing: sh NVIDIA-Linux-x86_64-495.46.run -s WARNING: nvidia-installer was forced to guess the X library path '/usr/lib64' and X module path '/usr/lib64/xorg/modules'; these paths were not queryable from the system. If X fails to find the NVIDIA X driver module, please install the `pkg-config` utility and the X.Org SDK/development package for your distribution and reinstall the driver. ```
out.log ``` [2022-05-05 12:34:36] Executing: which nvidia-smi [2022-05-05 12:34:36] Executing: uname -r 5.13.0-1024-gcp [2022-05-05 12:34:36] Executing: apt update Hit:1 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal InRelease Get:2 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB] Get:3 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB] Get:4 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-backports InRelease [108 kB] Get:5 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal/universe amd64 Packages [8628 kB] Get:6 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal/universe Translation-en [5124 kB] Get:7 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal/universe amd64 c-n-f Metadata [265 kB] Get:8 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal/multiverse amd64 Packages [144 kB] Get:9 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal/multiverse Translation-en [104 kB] Get:10 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal/multiverse amd64 c-n-f Metadata [9136 B] Get:11 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages [1750 kB] Get:12 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/main Translation-en [326 kB] Get:13 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/main amd64 c-n-f Metadata [15.0 kB] Get:14 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/restricted amd64 Packages [947 kB] Get:15 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/restricted Translation-en [135 kB] Get:16 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/universe amd64 Packages [921 kB] Get:17 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/universe Translation-en [206 kB] Get:18 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/universe amd64 c-n-f Metadata [20.7 kB] Get:19 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/multiverse amd64 Packages [24.4 kB] Get:20 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/multiverse Translation-en [7336 B] Get:21 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/multiverse amd64 c-n-f Metadata [592 B] Get:22 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-backports/main amd64 Packages [42.2 kB] Get:23 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-backports/main Translation-en [10.1 kB] Get:24 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-backports/main amd64 c-n-f Metadata [864 B] Get:25 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-backports/restricted amd64 c-n-f Metadata [116 B] Get:26 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-backports/universe amd64 Packages [22.7 kB] Get:27 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-backports/universe Translation-en [15.5 kB] Get:28 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-backports/universe amd64 c-n-f Metadata [804 B] Get:29 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-backports/multiverse amd64 c-n-f Metadata [116 B] Get:30 http://security.ubuntu.com/ubuntu focal-security/main amd64 Packages [1422 kB] Get:31 http://security.ubuntu.com/ubuntu focal-security/main Translation-en [246 kB] Get:32 http://security.ubuntu.com/ubuntu focal-security/main amd64 c-n-f Metadata [10.1 kB] Get:33 http://security.ubuntu.com/ubuntu focal-security/restricted amd64 Packages [886 kB] Get:34 http://security.ubuntu.com/ubuntu focal-security/restricted Translation-en [126 kB] Get:35 http://security.ubuntu.com/ubuntu focal-security/universe amd64 Packages [700 kB] Get:36 http://security.ubuntu.com/ubuntu focal-security/universe Translation-en [124 kB] Get:37 http://security.ubuntu.com/ubuntu focal-security/universe amd64 c-n-f Metadata [14.4 kB] Get:38 http://security.ubuntu.com/ubuntu focal-security/multiverse amd64 Packages [20.7 kB] Get:39 http://security.ubuntu.com/ubuntu focal-security/multiverse Translation-en [5196 B] Get:40 http://security.ubuntu.com/ubuntu focal-security/multiverse amd64 c-n-f Metadata [500 B] Fetched 22.6 MB in 3s (6896 kB/s) Reading package lists... Building dependency tree... Reading state information... 13 packages can be upgraded. Run 'apt list --upgradable' to see them. [2022-05-05 12:34:45] Executing: apt install -y linux-headers-5.13.0-1024-gcp software-properties-common pciutils gcc make Reading package lists... Building dependency tree... Reading state information... linux-headers-5.13.0-1024-gcp is already the newest version (5.13.0-1024.29~20.04.1). pciutils is already the newest version (1:3.6.4-1ubuntu0.20.04.1). pciutils set to manually installed. software-properties-common is already the newest version (0.99.9.8). software-properties-common set to manually installed. The following packages were automatically installed and are no longer required: libatasmart4 libblockdev-fs2 libblockdev-loop2 libblockdev-part-err2 libblockdev-part2 libblockdev-swap2 libblockdev-utils2 libblockdev2 libmm-glib0 libnspr4 libnss3 libnuma1 libparted-fs-resize0 libudisks2-0 usb-modeswitch usb-modeswitch-data Use 'sudo apt autoremove' to remove them. The following additional packages will be installed: binutils binutils-common binutils-x86-64-linux-gnu cpp cpp-9 gcc-9 gcc-9-base libasan5 libatomic1 libbinutils libc-dev-bin libc6-dev libcc1-0 libcrypt-dev libctf-nobfd0 libctf0 libgcc-9-dev libgomp1 libisl22 libitm1 liblsan0 libmpc3 libquadmath0 libtsan0 libubsan1 linux-libc-dev manpages-dev Suggested packages: binutils-doc cpp-doc gcc-9-locales gcc-multilib autoconf automake libtool flex bison gdb gcc-doc gcc-9-multilib gcc-9-doc glibc-doc make-doc The following NEW packages will be installed: binutils binutils-common binutils-x86-64-linux-gnu cpp cpp-9 gcc gcc-9 gcc-9-base libasan5 libatomic1 libbinutils libc-dev-bin libc6-dev libcc1-0 libcrypt-dev libctf-nobfd0 libctf0 libgcc-9-dev libgomp1 libisl22 libitm1 liblsan0 libmpc3 libquadmath0 libtsan0 libubsan1 linux-libc-dev make manpages-dev 0 upgraded, 29 newly installed, 0 to remove and 13 not upgraded. Need to get 34.2 MB of archives. After this operation, 151 MB of additional disk space will be used. Get:1 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/main amd64 binutils-common amd64 2.34-6ubuntu1.3 [207 kB] Get:2 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/main amd64 libbinutils amd64 2.34-6ubuntu1.3 [474 kB] Get:3 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/main amd64 libctf-nobfd0 amd64 2.34-6ubuntu1.3 [47.4 kB] Get:4 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/main amd64 libctf0 amd64 2.34-6ubuntu1.3 [46.6 kB] Get:5 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/main amd64 binutils-x86-64-linux-gnu amd64 2.34-6ubuntu1.3 [1613 kB] Get:6 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/main amd64 binutils amd64 2.34-6ubuntu1.3 [3380 B] Get:7 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/main amd64 gcc-9-base amd64 9.4.0-1ubuntu1~20.04.1 [19.4 kB] Get:8 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal/main amd64 libisl22 amd64 0.22.1-1 [592 kB] Get:9 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal/main amd64 libmpc3 amd64 1.1.0-1 [40.8 kB] Get:10 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/main amd64 cpp-9 amd64 9.4.0-1ubuntu1~20.04.1 [7500 kB] Get:11 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal/main amd64 cpp amd64 4:9.3.0-1ubuntu2 [27.6 kB] Get:12 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/main amd64 libcc1-0 amd64 10.3.0-1ubuntu1~20.04 [48.8 kB] Get:13 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/main amd64 libgomp1 amd64 10.3.0-1ubuntu1~20.04 [102 kB] Get:14 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/main amd64 libitm1 amd64 10.3.0-1ubuntu1~20.04 [26.2 kB] Get:15 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/main amd64 libatomic1 amd64 10.3.0-1ubuntu1~20.04 [9284 B] Get:16 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/main amd64 libasan5 amd64 9.4.0-1ubuntu1~20.04.1 [2751 kB] Get:17 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/main amd64 liblsan0 amd64 10.3.0-1ubuntu1~20.04 [835 kB] Get:18 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/main amd64 libtsan0 amd64 10.3.0-1ubuntu1~20.04 [2009 kB] Get:19 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/main amd64 libubsan1 amd64 10.3.0-1ubuntu1~20.04 [784 kB] Get:20 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/main amd64 libquadmath0 amd64 10.3.0-1ubuntu1~20.04 [146 kB] Get:21 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/main amd64 libgcc-9-dev amd64 9.4.0-1ubuntu1~20.04.1 [2359 kB] Get:22 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/main amd64 gcc-9 amd64 9.4.0-1ubuntu1~20.04.1 [8274 kB] Get:23 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal/main amd64 gcc amd64 4:9.3.0-1ubuntu2 [5208 B] Get:24 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/main amd64 libc-dev-bin amd64 2.31-0ubuntu9.7 [71.6 kB] Get:25 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/main amd64 linux-libc-dev amd64 5.4.0-109.123 [1111 kB] Get:26 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal/main amd64 libcrypt-dev amd64 1:4.4.10-10ubuntu4 [104 kB] Get:27 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal-updates/main amd64 libc6-dev amd64 2.31-0ubuntu9.7 [2518 kB] Get:28 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal/main amd64 make amd64 4.2.1-1.2 [162 kB] Get:29 http://europe-west4.gce.archive.ubuntu.com/ubuntu focal/main amd64 manpages-dev all 5.05-1 [2266 kB] Fetched 34.2 MB in 0s (87.3 MB/s) Selecting previously unselected package binutils-common:amd64. (Reading database ... 60927 files and directories currently installed.) Preparing to unpack .../00-binutils-common_2.34-6ubuntu1.3_amd64.deb ... Unpacking binutils-common:amd64 (2.34-6ubuntu1.3) ... Selecting previously unselected package libbinutils:amd64. Preparing to unpack .../01-libbinutils_2.34-6ubuntu1.3_amd64.deb ... Unpacking libbinutils:amd64 (2.34-6ubuntu1.3) ... Selecting previously unselected package libctf-nobfd0:amd64. Preparing to unpack .../02-libctf-nobfd0_2.34-6ubuntu1.3_amd64.deb ... Unpacking libctf-nobfd0:amd64 (2.34-6ubuntu1.3) ... Selecting previously unselected package libctf0:amd64. Preparing to unpack .../03-libctf0_2.34-6ubuntu1.3_amd64.deb ... Unpacking libctf0:amd64 (2.34-6ubuntu1.3) ... Selecting previously unselected package binutils-x86-64-linux-gnu. Preparing to unpack .../04-binutils-x86-64-linux-gnu_2.34-6ubuntu1.3_amd64.deb ... Unpacking binutils-x86-64-linux-gnu (2.34-6ubuntu1.3) ... Selecting previously unselected package binutils. Preparing to unpack .../05-binutils_2.34-6ubuntu1.3_amd64.deb ... Unpacking binutils (2.34-6ubuntu1.3) ... Selecting previously unselected package gcc-9-base:amd64. Preparing to unpack .../06-gcc-9-base_9.4.0-1ubuntu1~20.04.1_amd64.deb ... Unpacking gcc-9-base:amd64 (9.4.0-1ubuntu1~20.04.1) ... Selecting previously unselected package libisl22:amd64. Preparing to unpack .../07-libisl22_0.22.1-1_amd64.deb ... Unpacking libisl22:amd64 (0.22.1-1) ... Selecting previously unselected package libmpc3:amd64. Preparing to unpack .../08-libmpc3_1.1.0-1_amd64.deb ... Unpacking libmpc3:amd64 (1.1.0-1) ... Selecting previously unselected package cpp-9. Preparing to unpack .../09-cpp-9_9.4.0-1ubuntu1~20.04.1_amd64.deb ... Unpacking cpp-9 (9.4.0-1ubuntu1~20.04.1) ... Selecting previously unselected package cpp. Preparing to unpack .../10-cpp_4%3a9.3.0-1ubuntu2_amd64.deb ... Unpacking cpp (4:9.3.0-1ubuntu2) ... Selecting previously unselected package libcc1-0:amd64. Preparing to unpack .../11-libcc1-0_10.3.0-1ubuntu1~20.04_amd64.deb ... Unpacking libcc1-0:amd64 (10.3.0-1ubuntu1~20.04) ... Selecting previously unselected package libgomp1:amd64. Preparing to unpack .../12-libgomp1_10.3.0-1ubuntu1~20.04_amd64.deb ... Unpacking libgomp1:amd64 (10.3.0-1ubuntu1~20.04) ... Selecting previously unselected package libitm1:amd64. Preparing to unpack .../13-libitm1_10.3.0-1ubuntu1~20.04_amd64.deb ... Unpacking libitm1:amd64 (10.3.0-1ubuntu1~20.04) ... Selecting previously unselected package libatomic1:amd64. Preparing to unpack .../14-libatomic1_10.3.0-1ubuntu1~20.04_amd64.deb ... Unpacking libatomic1:amd64 (10.3.0-1ubuntu1~20.04) ... Selecting previously unselected package libasan5:amd64. Preparing to unpack .../15-libasan5_9.4.0-1ubuntu1~20.04.1_amd64.deb ... Unpacking libasan5:amd64 (9.4.0-1ubuntu1~20.04.1) ... Selecting previously unselected package liblsan0:amd64. Preparing to unpack .../16-liblsan0_10.3.0-1ubuntu1~20.04_amd64.deb ... Unpacking liblsan0:amd64 (10.3.0-1ubuntu1~20.04) ... Selecting previously unselected package libtsan0:amd64. Preparing to unpack .../17-libtsan0_10.3.0-1ubuntu1~20.04_amd64.deb ... Unpacking libtsan0:amd64 (10.3.0-1ubuntu1~20.04) ... Selecting previously unselected package libubsan1:amd64. Preparing to unpack .../18-libubsan1_10.3.0-1ubuntu1~20.04_amd64.deb ... Unpacking libubsan1:amd64 (10.3.0-1ubuntu1~20.04) ... Selecting previously unselected package libquadmath0:amd64. Preparing to unpack .../19-libquadmath0_10.3.0-1ubuntu1~20.04_amd64.deb ... Unpacking libquadmath0:amd64 (10.3.0-1ubuntu1~20.04) ... Selecting previously unselected package libgcc-9-dev:amd64. Preparing to unpack .../20-libgcc-9-dev_9.4.0-1ubuntu1~20.04.1_amd64.deb ... Unpacking libgcc-9-dev:amd64 (9.4.0-1ubuntu1~20.04.1) ... Selecting previously unselected package gcc-9. Preparing to unpack .../21-gcc-9_9.4.0-1ubuntu1~20.04.1_amd64.deb ... Unpacking gcc-9 (9.4.0-1ubuntu1~20.04.1) ... Selecting previously unselected package gcc. Preparing to unpack .../22-gcc_4%3a9.3.0-1ubuntu2_amd64.deb ... Unpacking gcc (4:9.3.0-1ubuntu2) ... Selecting previously unselected package libc-dev-bin. Preparing to unpack .../23-libc-dev-bin_2.31-0ubuntu9.7_amd64.deb ... Unpacking libc-dev-bin (2.31-0ubuntu9.7) ... Selecting previously unselected package linux-libc-dev:amd64. Preparing to unpack .../24-linux-libc-dev_5.4.0-109.123_amd64.deb ... Unpacking linux-libc-dev:amd64 (5.4.0-109.123) ... Selecting previously unselected package libcrypt-dev:amd64. Preparing to unpack .../25-libcrypt-dev_1%3a4.4.10-10ubuntu4_amd64.deb ... Unpacking libcrypt-dev:amd64 (1:4.4.10-10ubuntu4) ... Selecting previously unselected package libc6-dev:amd64. Preparing to unpack .../26-libc6-dev_2.31-0ubuntu9.7_amd64.deb ... Unpacking libc6-dev:amd64 (2.31-0ubuntu9.7) ... Selecting previously unselected package make. Preparing to unpack .../27-make_4.2.1-1.2_amd64.deb ... Unpacking make (4.2.1-1.2) ... Selecting previously unselected package manpages-dev. Preparing to unpack .../28-manpages-dev_5.05-1_all.deb ... Unpacking manpages-dev (5.05-1) ... Setting up manpages-dev (5.05-1) ... Setting up binutils-common:amd64 (2.34-6ubuntu1.3) ... Setting up linux-libc-dev:amd64 (5.4.0-109.123) ... Setting up libctf-nobfd0:amd64 (2.34-6ubuntu1.3) ... Setting up libgomp1:amd64 (10.3.0-1ubuntu1~20.04) ... Setting up make (4.2.1-1.2) ... Setting up libquadmath0:amd64 (10.3.0-1ubuntu1~20.04) ... Setting up libmpc3:amd64 (1.1.0-1) ... Setting up libatomic1:amd64 (10.3.0-1ubuntu1~20.04) ... Setting up libubsan1:amd64 (10.3.0-1ubuntu1~20.04) ... Setting up libcrypt-dev:amd64 (1:4.4.10-10ubuntu4) ... Setting up libisl22:amd64 (0.22.1-1) ... Setting up libbinutils:amd64 (2.34-6ubuntu1.3) ... Setting up libc-dev-bin (2.31-0ubuntu9.7) ... Setting up libcc1-0:amd64 (10.3.0-1ubuntu1~20.04) ... Setting up liblsan0:amd64 (10.3.0-1ubuntu1~20.04) ... Setting up libitm1:amd64 (10.3.0-1ubuntu1~20.04) ... Setting up gcc-9-base:amd64 (9.4.0-1ubuntu1~20.04.1) ... Setting up libtsan0:amd64 (10.3.0-1ubuntu1~20.04) ... Setting up libctf0:amd64 (2.34-6ubuntu1.3) ... Setting up libasan5:amd64 (9.4.0-1ubuntu1~20.04.1) ... Setting up cpp-9 (9.4.0-1ubuntu1~20.04.1) ... Setting up libc6-dev:amd64 (2.31-0ubuntu9.7) ... Setting up binutils-x86-64-linux-gnu (2.34-6ubuntu1.3) ... Setting up binutils (2.34-6ubuntu1.3) ... Setting up libgcc-9-dev:amd64 (9.4.0-1ubuntu1~20.04.1) ... Setting up cpp (4:9.3.0-1ubuntu2) ... Setting up gcc-9 (9.4.0-1ubuntu1~20.04.1) ... Setting up gcc (4:9.3.0-1ubuntu2) ... Processing triggers for man-db (2.9.1-1) ... Processing triggers for libc-bin (2.31-0ubuntu9.7) ... [2022-05-05 12:35:02] Executing: lspci -n 00:00.0 0600: 8086:1237 (rev 02) 00:01.0 0601: 8086:7110 (rev 03) 00:01.3 0680: 8086:7113 (rev 03) 00:03.0 0000: 1af4:1004 00:04.0 0302: 10de:20b0 (rev a1) 00:05.0 0200: 1af4:1000 00:06.0 00ff: 1af4:1005 [2022-05-05 12:35:02] Executing: lspci -n 00:00.0 0600: 8086:1237 (rev 02) 00:01.0 0601: 8086:7110 (rev 03) 00:01.3 0680: 8086:7113 (rev 03) 00:03.0 0000: 1af4:1004 00:04.0 0302: 10de:20b0 (rev a1) 00:05.0 0200: 1af4:1000 00:06.0 00ff: 1af4:1005 [2022-05-05 12:35:02] Executing: curl -fSsl -O https://us.download.nvidia.com/XFree86/Linux-x86_64/495.46/NVIDIA-Linux-x86_64-495.46.run [2022-05-05 12:35:03] Executing: sh NVIDIA-Linux-x86_64-495.46.run -s Verifying archive integrity... OK Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 495.46...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... ```
m-strzelczyk commented 2 years ago

OK, it looks like the installation process was completed successfully. For some reason though, the kernel driver modules aren't loaded.

Let's see if the nvidia kernel modules are still present in the filesystem. Please check the contents of find /lib/modules -name nvidia*. There should be some file listed, for my Ubuntu 20.04 A100 installation those were:

$ find /lib/modules -name nvidia*
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/nvidia-drm.ko
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/nvidia-uvm.ko
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/nvidia.ko
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/fbdev/nvidia
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/nvidia-peermem.ko
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/nvidia-modeset.ko

If the files are there, could you try running sudo nvidia-modprobe and then nvidia-smi and if that fails sudo nvidia-smi? The nvidia-modprobe is, according to its own manual page: (...) create, in a Linux distribution-independent way, NVIDIA Linux device files and load the NVIDIA kernel module (...).

In general, we need to drill down to the point where the system fails to load the nvidia modules - because we can't see them on lsmod output and they are required by nvidia-smi and anything that wants to interact with the GPU.

JulesBelveze commented 2 years ago

It does seem like the files are there:

$ find /lib/modules -name nvidia*
/lib/modules/5.13.0-1025-gcp/kernel/drivers/video/fbdev/nvidia
/lib/modules/5.13.0-1025-gcp/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/nvidia-drm.ko
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/nvidia-uvm.ko
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/nvidia.ko
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/fbdev/nvidia
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/nvidia-peermem.ko
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/nvidia-modeset.ko

However, here's the output of the commands you asked.. don't really think this is gonna help 😞

$ sudo nvidia-modprobe
$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

$ sudo nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and **running.**
m-strzelczyk commented 2 years ago

OK... Weird that you have more files than me, when we both use Ubuntu 20, but this doesn't explain why the kernel doesn't load the modules.

Let's play with the modules manually a bit and see what happens. Try loading the nvidia module manually first with: sudo modprobe nvidia, check if it's loaded with lsmod. If it's loaded, it should be on the list.

If it's not loaded try a more manual way to load a module with ismod. To load nvidia module with insmod you'll first need to load the drm module. You can find it with find /lib/modules -name drm.ko. Once you find it, you can:

sudo insmod $PATH_TO_DRM_KO
sudo insmod $PATH_TO_NVIDIA_KO # probably /lib/modules/5.13.0-1024-gcp/kernel/drivers/video/nvidia.ko

Then check again if the module is loaded with lsmod. Then check dmesg for any clues about what's going on.

After all that, grab the full dmesg output and send it over here, so I can have a look.

JulesBelveze commented 2 years ago

Seems like I can't load the NVIDIA module manually:

$ sudo modprobe nvidia
modprobe: FATAL: Module nvidia not found in directory /lib/modules/5.13.0-1025-gcp

Then when I try to locate the drm module I actually get two paths. I've tried with both to load the NVIDIA module with insmod but I'm getting an issue:

$ find /lib/modules -name drm.ko
/lib/modules/5.13.0-1025-gcp/kernel/drivers/gpu/drm/drm.ko
/lib/modules/5.13.0-1024-gcp/kernel/drivers/gpu/drm/drm.ko

$ export PATH_TO_DRM_KO=/lib/modules/5.13.0-1024-gcp/kernel/drivers/gpu/drm/drm.ko
$ export PATH_TO_NVIDIA_KO=/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/nvidia.ko
$ sudo insmod $PATH_TO_DRM_KO
insmod: ERROR: could not insert module /lib/modules/5.13.0-1024-gcp/kernel/drivers/gpu/drm/drm.ko: File exists
$ sudo insmod $PATH_TO_NVIDIA_KO
insmod: ERROR: could not insert module /lib/modules/5.13.0-1024-gcp/kernel/drivers/video/nvidia.ko: Invalid parameters

Am I doing something wrong??

Note: the issue seems to be appear as soon as a VM with GPU gets preempted. It just occurred with a similar instance of mine.

m-strzelczyk commented 2 years ago

No, you are not doing anything wrong. The problem seems to be with either the installation method or some weird order in which things happened when you installed the drivers. Either way, I'll need to figure this out and make sure it's fixed.

I think I see what's the issue here. You have 2 subfolders in your /lib/modules/ directory: 5.13.0-1025-gcp and 5.13.0-1024-gcp. The 1024 one contains the drm.ko and nvidia.ko modules, however the 1025 does not.

Here's what I think is going on:

  1. You created a new Ubuntu instance, it came with kernel version 5.13.0-1024.
  2. You installed the GPU drivers, in the process or soon after, the kernel got updated to 5.13.0-1025.
  3. Because you can't update the kernel completely without a reboot, the old kernel with the proper modules installed worked fine, till you were forced to reboot by preemption.
  4. The machine got restarted, using the new 5.13.0-1025 kernel, which for some reason didn't inherit the GPU driver modules.

My temporary solution for you:

# Download the driver binary, or it might be still present in /opt/google/gpu-installer
curl -fSsl -O https://us.download.nvidia.com/XFree86/Linux-x86_64/495.46/NVIDIA-Linux-x86_64-495.46.run

# Run the installer to reinstall the kernel modules to the new kernel version.
sudo sh NVIDIA-Linux-x86_64-495.46.run -s

This simply reinstalls the driver for the new kernel version. It should survive any preemptions and reboots, until there is a new version of kernel installed. Then you'll have to reinstall it again. Since all the prerequisites for the driver installer were already met, it's OK to just execute the NVIDIA-Linux-x86_64-495.46.run without the full script from this repository.

I will work on updating the script and our documentation to find a way around this, so it's automatically taken care of.

JulesBelveze commented 2 years ago

Interesting! You workaround did work as expected, thanks for the hint and your precious help 😃

m-strzelczyk commented 2 years ago

That's great to hear! :) I will work on permanent and a more convenient solution as soon as I can.

m-strzelczyk commented 2 years ago

This commit should resolve this problem. DKMS will rebuild the driver modules on kernel update. I'll run some tests on how it works, but this should be it for the issue.