Closed JulesBelveze closed 2 years ago
Hi!
Thanks for reporting this problem. It seems that it's related more to the driver itself than the installation script that we have here, but I'll try to help anyway :)
Just to make sure, the Debian and Ubuntu images you use are the default public ones available in GCP, right?
Since this is A100 instance, there's only one machine type you can pick a2-highgpu-1g
- but can you tell me in which zone are you running your instances? I just want to be able to replicate the issue as closely as possible.
Is there anything else that's not default about your instance? Did you use Secure Boot?
Also, could you provide your dmesg
or at least dmesg | grep -i nvidia
?
Hi @m-strzelczyk thanks for your help!
Just to make sure, the Debian and Ubuntu images you use are the default public ones available in GCP, right?
Yes I am using the default images
Since this is A100 instance, there's only one machine type you can pick a2-highgpu-1g - but can you tell me in which zone are you running your instances?
The machine type is indeed a2-highgpu-1g
and I'm running my instances in europe-west4-a
Is there anything else that's not default about your instance? Did you use Secure Boot?
The instances I'm using are preemptible (dunno if this can help)
@JulesBelveze I was not able to replicate your issue by creating a preemptible A100 equiped machine in us-central1, I'll try again in europe-west4-a
like you did.
In the meantime, could you please share your dmesg
output and lsmod
output? This should tell us if the NVIDIA drivers are loading at all.
I've deleted the old instances (with which this issue occurred) and by turning off/on the instance I currently have I can't reproduce it either.
I'll ping you and share the commands output with you as soon as the error happens again.
@m-strzelczyk closing it for now as I'm not able to reproduce it. Will re-open if this occurs again, sorry for that
Hey @m-strzelczyk, it finally occurred again! Here's the output of the commands you asked for, let me know if you need anything else 😃
>>> dmesg | grep -i nvidia
[ 4.122300] audit: type=1400 audit(1652363138.199:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=496 comm="apparmor_parser"
[ 4.122306] audit: type=1400 audit(1652363138.199:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=496 comm="apparmor_parser"
>>> lsmod
Module Size Used by
nls_iso8859_1 16384 1
dm_multipath 40960 0
scsi_dh_rdac 16384 0
scsi_dh_emc 16384 0
scsi_dh_alua 20480 0
crct10dif_pclmul 16384 1
crc32_pclmul 16384 0
ghash_clmulni_intel 16384 0
aesni_intel 376832 0
virtio_net 57344 0
net_failover 20480 1 virtio_net
failover 16384 1 net_failover
crypto_simd 16384 1 aesni_intel
cryptd 24576 2 crypto_simd,ghash_clmulni_intel
input_leds 16384 0
psmouse 155648 0
serio_raw 20480 0
efi_pstore 16384 0
sch_fq_codel 20480 13
drm 557056 0
virtio_rng 16384 0
ip_tables 32768 0
x_tables 49152 1 ip_tables
autofs4 45056 2
Dunno if this could be related but the instance got preempted last time I used it.
Hi Jules!
Thanks for the new info :) If you still have the disk around, could you send me the log files from /opt/google/gpu-installer/
dir? The installation script should be logging everything to files in this directory, so it should tell us what happened with the installation. Thanks!
Here you go @m-strzelczyk 😃
OK, it looks like the installation process was completed successfully. For some reason though, the kernel driver modules aren't loaded.
Let's see if the nvidia kernel modules are still present in the filesystem. Please check the contents of find /lib/modules -name nvidia*
. There should be some file listed, for my Ubuntu 20.04 A100 installation those were:
$ find /lib/modules -name nvidia*
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/nvidia-drm.ko
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/nvidia-uvm.ko
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/nvidia.ko
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/fbdev/nvidia
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/nvidia-peermem.ko
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/nvidia-modeset.ko
If the files are there, could you try running sudo nvidia-modprobe
and then nvidia-smi
and if that fails sudo nvidia-smi
? The nvidia-modprobe
is, according to its own manual page: (...) create, in a Linux distribution-independent way, NVIDIA Linux device files and load the NVIDIA kernel module (...)
.
In general, we need to drill down to the point where the system fails to load the nvidia modules - because we can't see them on lsmod
output and they are required by nvidia-smi
and anything that wants to interact with the GPU.
It does seem like the files are there:
$ find /lib/modules -name nvidia*
/lib/modules/5.13.0-1025-gcp/kernel/drivers/video/fbdev/nvidia
/lib/modules/5.13.0-1025-gcp/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/nvidia-drm.ko
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/nvidia-uvm.ko
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/nvidia.ko
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/fbdev/nvidia
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/nvidia-peermem.ko
/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/nvidia-modeset.ko
However, here's the output of the commands you asked.. don't really think this is gonna help 😞
$ sudo nvidia-modprobe
$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
$ sudo nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and **running.**
OK... Weird that you have more files than me, when we both use Ubuntu 20, but this doesn't explain why the kernel doesn't load the modules.
Let's play with the modules manually a bit and see what happens. Try loading the nvidia module manually first with: sudo modprobe nvidia
, check if it's loaded with lsmod
. If it's loaded, it should be on the list.
If it's not loaded try a more manual way to load a module with ismod
. To load nvidia
module with insmod
you'll first need to load the drm
module. You can find it with find /lib/modules -name drm.ko
. Once you find it, you can:
sudo insmod $PATH_TO_DRM_KO
sudo insmod $PATH_TO_NVIDIA_KO # probably /lib/modules/5.13.0-1024-gcp/kernel/drivers/video/nvidia.ko
Then check again if the module is loaded with lsmod
. Then check dmesg
for any clues about what's going on.
After all that, grab the full dmesg
output and send it over here, so I can have a look.
Seems like I can't load the NVIDIA module manually:
$ sudo modprobe nvidia
modprobe: FATAL: Module nvidia not found in directory /lib/modules/5.13.0-1025-gcp
Then when I try to locate the drm
module I actually get two paths. I've tried with both to load the NVIDIA module with insmod
but I'm getting an issue:
$ find /lib/modules -name drm.ko
/lib/modules/5.13.0-1025-gcp/kernel/drivers/gpu/drm/drm.ko
/lib/modules/5.13.0-1024-gcp/kernel/drivers/gpu/drm/drm.ko
$ export PATH_TO_DRM_KO=/lib/modules/5.13.0-1024-gcp/kernel/drivers/gpu/drm/drm.ko
$ export PATH_TO_NVIDIA_KO=/lib/modules/5.13.0-1024-gcp/kernel/drivers/video/nvidia.ko
$ sudo insmod $PATH_TO_DRM_KO
insmod: ERROR: could not insert module /lib/modules/5.13.0-1024-gcp/kernel/drivers/gpu/drm/drm.ko: File exists
$ sudo insmod $PATH_TO_NVIDIA_KO
insmod: ERROR: could not insert module /lib/modules/5.13.0-1024-gcp/kernel/drivers/video/nvidia.ko: Invalid parameters
Am I doing something wrong??
Note: the issue seems to be appear as soon as a VM with GPU gets preempted. It just occurred with a similar instance of mine.
No, you are not doing anything wrong. The problem seems to be with either the installation method or some weird order in which things happened when you installed the drivers. Either way, I'll need to figure this out and make sure it's fixed.
I think I see what's the issue here. You have 2 subfolders in your /lib/modules/
directory: 5.13.0-1025-gcp
and 5.13.0-1024-gcp
. The 1024 one contains the drm.ko
and nvidia.ko
modules, however the 1025 does not.
Here's what I think is going on:
5.13.0-1024
.5.13.0-1025
.5.13.0-1025
kernel, which for some reason didn't inherit the GPU driver modules.My temporary solution for you:
# Download the driver binary, or it might be still present in /opt/google/gpu-installer
curl -fSsl -O https://us.download.nvidia.com/XFree86/Linux-x86_64/495.46/NVIDIA-Linux-x86_64-495.46.run
# Run the installer to reinstall the kernel modules to the new kernel version.
sudo sh NVIDIA-Linux-x86_64-495.46.run -s
This simply reinstalls the driver for the new kernel version. It should survive any preemptions and reboots, until there is a new version of kernel installed. Then you'll have to reinstall it again. Since all the prerequisites for the driver installer were already met, it's OK to just execute the NVIDIA-Linux-x86_64-495.46.run
without the full script from this repository.
I will work on updating the script and our documentation to find a way around this, so it's automatically taken care of.
Interesting! You workaround did work as expected, thanks for the hint and your precious help 😃
That's great to hear! :) I will work on permanent and a more convenient solution as soon as I can.
This commit should resolve this problem. DKMS will rebuild the driver modules on kernel update. I'll run some tests on how it works, but this should be it for the issue.
Describe the bug I am working with a A100 and have installed the NVIDIA driver using your script
install_gpu_driver.py
and everything works smoothly.However, I am experiencing a strange behaviour when I stop the instance and restart it. The error message is the following
and if I try to re-run the
install_gpu_driver.py
script then it says that the drivers are already installed...I have googled around and some people suggest to wait a moment before connecting to the instance; but that didn't solve my problem.
It has happened to me quite a lot of time and I've tried with different OS (Ubuntu and Debian) but this seems to be a recurring problem. Any idea why this occurs?
Environment