NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.74k stars 282 forks source link

nvidia-driver-validation crashloopbackoff #133

Open Mutantt opened 3 years ago

Mutantt commented 3 years ago

hi im using ubuntu 20.04 (kernel 5.4.0-62) and 460.32.03 nvidia driver image.also my gpu is 1660 ti. when i install the operator ,nvidia-driver-daemonset pod goes to running state and its log shows the installation is completed :

Runtime sanity check passed.

Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version 460.32.03) is now complete.

Loading ipmi and i2c_core kernel modules... Loading NVIDIA driver kernel modules... Starting NVIDIA persistence daemon... Mounting NVIDIA driver rootfs... Done, now waiting for signal

but nvidia-driver-validation is in loopbackoff state with below error:

Error: failed to start container "cuda-vector-add": Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: driver error: failed to process request: unknown

update: after one hour this is the message of nvidia-driver-validation pod when describing the pod : Message: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: detection error: nvml error: not found: unknown

would you please advise what can i do to solve the issue?

shivamerla commented 3 years ago

@Mutantt Which version of GPU operator are you using? Does validation pod stays in crashloop forever or it does recover?

Mutantt commented 3 years ago

hi @shivamerla , thank you for your reply, im using version 1.4.0 ,validation pod stays in crashloop for ever.

shivamerla commented 3 years ago

@Mutantt can you try with 1.5.0 and confirm if you still see this issue?

Mutantt commented 3 years ago

the result is same,this is the complete output of driver pod :

========= NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 460.32.03 for Linux kernel version 5.4.0-62-generic

Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs... Checking NVIDIA driver packages... Updating the package cache... Resolving Linux kernel version... Proceeding with Linux kernel version 5.4.0-62-generic Installing Linux kernel headers... Installing Linux kernel module files... Generating Linux kernel version string... Compiling NVIDIA driver kernel modules... /usr/src/nvidia-460.32.03/kernel/nvidia/nv-procfs.o: warning: objtool: .text.unlikely: unexpected end of section /usr/src/nvidia-460.32.03/kernel/nvidia-drm/nvidia-drm-modeset.c: In function '__will_generate_flip_event': /usr/src/nvidia-460.32.03/kernel/nvidia-drm/nvidia-drm-modeset.c:96:23: warning: unused variable 'primary_plane' [-Wunused-variable] 96 | struct drm_plane *primary_plane = crtc->primary; | ^~~~~ Relinking NVIDIA driver kernel modules... Building NVIDIA driver package nvidia-modules-5.4.0-62... Installing NVIDIA driver kernel modules...

WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.

ERROR: Unable to open 'kernel/dkms.conf' for copying (No such file or directory)

Welcome to the NVIDIA Software Installer for Unix/Linux

Detected 4 CPUs online; setting concurrency level to 4. Installing NVIDIA driver version 460.32.03. Performing CC sanity check with CC="/usr/bin/cc". Performing CC check. Kernel source path: '/lib/modules/5.4.0-62-generic/build'

Kernel output path: '/lib/modules/5.4.0-62-generic/build'

Performing Compiler check. Performing Dom0 check. Performing Xen check. Performing PREEMPT_RT check. Performing vgpu_kvm check. Cleaning kernel module build directory. Building kernel modules : [##############################] 100% Kernel module compilation complete. Unable to determine if Secure Boot is enabled: No such file or directory Kernel messages: [202656.645190] audit: type=1400 audit(1611428883.996:2144): avc: denied { read } for pid=2053680 comm="iptables" path="pipe:[23529]" dev="pipefs" ino=23529 scontext=system_u:system_r:iptables_t:s0 tcontext=system_u:system_r:kernel_t:s0 tclass=fifo_file permissive=1 [202656.651701] IPv6: ADDRCONF(NETDEV_CHANGE): vethwepl6c0ac87: link becomes ready [202656.651741] weave: port 4(vethwepl6c0ac87) entered blocking state [202656.651743] weave: port 4(vethwepl6c0ac87) entered forwarding state [202825.545994] audit: type=1400 audit(1611429052.899:2145): avc: denied { search } for pid=2056019 comm="systemd-detect-" name="sys" dev="proc" ino=4026531854 scontext=system_u:system_r:systemd_detect_virt_t:s0 tcontext=system_u:object_r:sysctl_t:s0 tclass=dir permissive=1 [202825.546009] audit: type=1400 audit(1611429052.899:2146): avc: denied { search } for pid=2056019 comm="systemd-detect-" name="kernel" dev="proc" ino=16455 scontext=system_u:system_r:systemd_detect_virt_t:s0 tcontext=system_u:object_r:sysctl_kernel_t:s0 tclass=dir permissive=1 [202825.546018] audit: type=1400 audit(1611429052.899:2147): avc: denied { read } for pid=2056019 comm="systemd-detect-" name="osrelease" dev="proc" ino=16456 scontext=system_u:system_r:systemd_detect_virt_t:s0 tcontext=system_u:object_r:sysctl_kernel_t:s0 tclass=file permissive=1 [202825.546026] audit: type=1400 audit(1611429052.899:2148): avc: denied { open } for pid=2056019 comm="systemd-detect-" path="/proc/sys/kernel/osrelease" dev="proc" ino=16456 scontext=system_u:system_r:systemd_detect_virt_t:s0 tcontext=system_u:object_r:sysctl_kernel_t:s0 tclass=file permissive=1 [202825.546035] audit: type=1400 audit(1611429052.899:2149): avc: denied { getattr } for pid=2056019 comm="systemd-detect-" path="/proc/sys/kernel/osrelease" dev="proc" ino=16456 scontext=system_u:system_r:systemd_detect_virt_t:s0 tcontext=system_u:object_r:sysctl_kernel_t:s0 tclass=file permissive=1 [202825.546044] audit: type=1400 audit(1611429052.899:2150): avc: denied { ioctl } for pid=2056019 comm="systemd-detect-" path="/proc/sys/kernel/osrelease" dev="proc" ino=16456 ioctlcmd=0x5401 scontext=system_u:system_r:systemd_detect_virt_t:s0 tcontext=system_u:object_r:sysctl_kernel_t:s0 tclass=file permissive=1 [202825.546052] audit: type=1400 audit(1611429052.899:2151): avc: denied { search } for pid=2056019 comm="systemd-detect-" name="systemd" dev="tmpfs" ino=268 scontext=system_u:system_r:systemd_detect_virt_t:s0 tcontext=system_u:object_r:init_var_run_t:s0 tclass=dir permissive=1 [202825.546062] audit: type=1400 audit(1611429052.899:2152): avc: denied { search } for pid=2056019 comm="systemd-detect-" name="1" dev="proc" ino=16457 scontext=system_u:system_r:systemd_detect_virt_t:s0 tcontext=system_u:system_r:init_t:s0 tclass=dir permissive=1 [202825.546113] audit: type=1400 audit(1611429052.899:2153): avc: denied { read } for pid=2056019 comm="systemd-detect-" name="environ" dev="proc" ino=16458 scontext=system_u:system_r:systemd_detect_virt_t:s0 tcontext=system_u:system_r:init_t:s0 tclass=file permissive=1 [202825.546125] audit: type=1400 audit(1611429052.899:2154): avc: denied { open } for pid=2056019 comm="systemd-detect-" path="/proc/1/environ" dev="proc" ino=16458 scontext=system_u:system_r:systemd_detect_virt_t:s0 tcontext=system_u:system_r:init_t:s0 tclass=file permissive=1 [202825.572868] SELinux: security_context_str_to_sid(system_u:object_r:snappy_snap_t:s0) failed for (dev loop6, type squashfs) errno=-22 [203065.099770] kauditd_printk_skb: 5 callbacks suppressed [203065.099771] audit: type=1400 audit(1611429292.455:2160): avc: denied { module_load } for pid=2070885 comm="modprobe" path="/usr/lib/modules/5.4.0-62-generic/kernel/drivers/vfio/mdev/mdev.ko" dev="overlay" ino=3934476 scontext=system_u:system_r:initrc_t:s0 tcontext=system_u:object_r:var_lib_t:s0 tclass=system permissive=1 [203065.150284] nvidia-nvlink: Nvlink Core is being initialized, major device number 239 [203065.151551] nvidia 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none [203065.194241] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 460.32.03 Sun Dec 27 19:00:34 UTC 2020 [203065.204443] nvidia-uvm: Loaded the UVM driver, major device number 237. [203065.207500] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 460.32.03 Sun Dec 27 18:51:11 UTC 2020 [203065.212057] nvidia-modeset: Unloading [203065.276320] nvidia-uvm: Unloaded the UVM driver. [203065.293350] nvidia-nvlink: Unregistered the Nvlink Core, major device number 239 Installing 'NVIDIA Accelerated Graphics Driver for Linux-x86_64' (460.32.03): Installing: [##############################] 100% Driver file installation is complete. Running post-install sanity check: Checking: [##############################] 100% Post-install sanity check passed. Running runtime sanity check: Checking: [##############################] 100% Runtime sanity check passed.

Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version 460.32.03) is now complete.

Loading ipmi and i2c_core kernel modules... Loading NVIDIA driver kernel modules... Starting NVIDIA persistence daemon... Mounting NVIDIA driver rootfs... Done, now waiting for signal

and this is the result of describing validation pod :

Error: failed to start container "cuda-vector-add": Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: driver error: failed to process request: unknown

shivamerla commented 3 years ago

@Mutantt Are you sure its 1.5.0? Asking because cuda-vecrot-add is no longer a standalone pod like earlier releases. Its added as part of initContainer into device-plugin pods. Can you run helm ls to make sure you are using right version? After un-install, please run docker info | grep -i runtime and make sure its set to runc and then re-install operator.

Mutantt commented 3 years ago

i have updated all necessary images on our local repo and used 1.5.0 version.also checked the runtime and its set to runc. now toolkit pod comes up with the status init:0/1 and stays in that sate. i cant check the pods log but here is the toolkit driver validation container log:

waiting for nvidia drivers to be loaded sh: nvidia-smi: command not found waiting for nvidia drivers to be loaded Unable to determine the device handle for GPU 0000:03:00.0: Unknown Error

and here is the last lines of driver pod logs:

Installing 'NVIDIA Accelerated Graphics Driver for Linux-x86_64' (460.32.03): Installing: [##############################] 100% Driver file installation is complete. Running post-install sanity check: Checking: [##############################] 100% Post-install sanity check passed. Running runtime sanity check: Checking: [##############################] 100% Runtime sanity check passed.

Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version 460.32.03) is now complete.

Loading ipmi and i2c_core kernel modules... Loading NVIDIA driver kernel modules... Starting NVIDIA persistence daemon... Mounting NVIDIA driver rootfs... Done, now waiting for signal

shivamerla commented 3 years ago

@Mutantt We have officially not qualified with 460.xx drivers. I will get back to you on this.

Mutantt commented 3 years ago

@shivamerla i have tested 450.80.02 too,the result is same,here is the output of driver pod:

[ 867.522489] nvidia: loading out-of-tree module taints kernel. [ 867.522500] nvidia: module license 'NVIDIA' taints kernel. [ 867.522500] Disabling lock debugging due to kernel taint [ 867.532121] nvidia: module verification failed: signature and/or required key missing - tainting kernel [ 867.539152] nvidia-nvlink: Nvlink Core is being initialized, major device number 239 [ 867.540455] nvidia 0000:03:00.0: enabling device (0000 -> 0003) [ 867.543085] nvidia 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none [ 867.587806] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 450.80.02 Wed Sep 23 01:13:39 UTC 2020 [ 867.600074] nvidia-uvm: Loaded the UVM driver, major device number 237. [ 867.604175] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 450.80.02 Wed Sep 23 00:48:09 UTC 2020 [ 867.607213] nvidia-modeset: Unloading [ 867.672662] nvidia-uvm: Unloaded the UVM driver. [ 867.708051] nvidia-nvlink: Unregistered the Nvlink Core, major device number 239 Installing 'NVIDIA Accelerated Graphics Driver for Linux-x86_64' (450.80.02): Installing: [##############################] 100% Driver file installation is complete. Running post-install sanity check: Checking: [##############################] 100% Post-install sanity check passed. Running runtime sanity check: Checking: [##############################] 100% Runtime sanity check passed.

Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version 450.80.02) is now complete.

Loading ipmi and i2c_core kernel modules... Loading NVIDIA driver kernel modules... Starting NVIDIA persistence daemon... Mounting NVIDIA driver rootfs... Done, now waiting for signal

Mutantt commented 3 years ago

hi @shivamerla , I'm using a vm for gpu worker on my k8s cluster and used passthrough to pass the gpu to that worker, is it necessary to install nvidia vGPU in my ESXi host or my approach should be working without it ?

shivamerla commented 3 years ago

@Mutantt If you have attached as passthrough then vGPU drivers are not necessary. Is it possible to attach the VM config info and journal logs from the system?

Mutantt commented 3 years ago

@shivamerla here are the information you requested:

journal-log.txt INF-GPULog.txt

hbahadorzadeh commented 3 years ago

Hi, I have the same problem. Any updates on this issue?

hbahadorzadeh commented 3 years ago

Dear @shivamerla, do you have any updates?

shivamerla commented 3 years ago

@hbahadorzadeh can you try out v1.7.0 and check if the validation pod still crashing?

hbahadorzadeh commented 3 years ago

@shivamerla thanks for following up the issue.

We solved our problem and the issue is not related to GPU-operator or Nvidia-drivers. It was VmWare problem on pass-through vga. The problem solved by adding "hypervisor.cpuid.v0=false" to vm options. I should say that if you add that option to your vm your linux won't boot up unless you change the number of vcpu to one! I tried to boot my Ubuntu with multiple cpus and weirdly it boots only if you boot it normaly from recovery menu!

I know that the option causes the vm os to not recoginze itself as a vm, but I don't know the reason of ubuntu crashing while booting up!

rhysjtevans commented 3 years ago

I've run into this issue as well, however my driver-validation init container ina crashloop trying to

failed to create containerd task: OCI runtime create failed: container_linux.go:380: starting container process │
│  caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia │
│ -container-cli.real: ldcache error: open failed: /run/nvidia/driver/sbin/ldconfig.real: no such file or directory: unknown

I can see a ldconfig in /run/nvidia/driver/sbin/ but not ldconfig.real. I'm running version 1.7.1 and

@shivamerla in relation to your note above We have officially not qualified with... is there a list or a latest version you have? I tried 450.80.02-centos8 and I continuously got compilation errors see #210

rhysjtevans commented 3 years ago

I've rebuilt the driver docker image using 450.119.04 and results are the same

shivamerla commented 3 years ago

@rhysjtevans I think we had verified with 460.32.03 with CentOS8. I will look into details of compilation errors soon.

rhysjtevans commented 3 years ago

Thanks @shivamerla I will test with 460.32.03

rhysjtevans commented 3 years ago

I believe I've found the issue (for my problem at least) but not sure what the root cause is as to why it's looking for a .real file anyway so for now I've created a PR for just the centos8 image as it appears the fedora image uses the same "fix".

For people who stumble on this, I built the container image by

git clone git@gitlab.com:nvidia/container-images/driver.git && cd driver/centos8 && \
docker build --build-arg DRIVER_VERSION=460.73.01 \
                      --build-arg BASE_URL=https://uk.download.nvidia.com/tesla \
                      -t nvidia/driver:460.73.01_0.1.0-centos8 \
                      .

Here's my PR. https://gitlab.com/nvidia/container-images/driver/-/merge_requests/131

shivamerla commented 3 years ago

@rhysjtevans The comment here explains the reason why .real file is preferred. But if this is not present libnvidia-container should fall back to using /sbin/ldconfig file. @elezar to confirm why this is not happening. Is this something that recently got fixed?

elezar commented 3 years ago

@rhysjtevans @shivamerla there was a change to how the config is processed (see c34dcd6b572b3cc8c08a38e4398fc22574dc1d42). This should not have affected the behaviour in the case where the config had already been modified on the host.

The issue here is that we are reading the source config from the container-toolkit container, which is released as either an ubuntu18.04 (the default) or a ubi8 image (using the toolkit.version value when installing). In this case -- where we have a centos8 host the setting using .real does not match the state of the host.

Note that this would not explain the issues that @Mutantt is seeing, but it could be different issues manifesting the same symptoms.

In terms of options, @rhysjtevans you could use the ubi8 image for the container-toolkit by adding --set toolkit.version="1.5.0-ubi8" when installing the operator using Helm.

On our end we could improve how we actually resolve the ldconfig used and release that as part of our upcoming container-toolkit container release. Looking at the code in libnvidia-container that @shivamerla pointed out, this is only applied if the ldconfig path is not explicitly set in the config, which is not the case here, meaning that the non-existent path is maintained.

@rhysjtevans you should be able to modify the config on the host to remove the .real suffix for the time being and check whether this addresses the issue -- if the toolkit container is restarted this will most likely create the "incorrect" config though.