NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.77k stars 286 forks source link

Enabling gpu on microk8s, pod/nvidia-driver-daemonset restart many times at status CrashLoopBackOff #751

Closed haiph-dev closed 2 months ago

haiph-dev commented 3 months ago

1. Quick Debug Information

2. Issue or feature description

Enabling gpu on microk8s, pod/nvidia-driver-daemonset restart many times at status CrashLoopBackOff

3. Steps to reproduce the issue

I follow instructions from https://www.nvidia.com/en-us/on-demand/session/gtcspring21-ss33138/ to install microk8s on Ubuntu 22.04. Instructions mentioned that not install nvidia driver. I tried both but the same result.

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.5.0/local_installers/cuda-repo-ubuntu2204-12-5-local_12.5.0-555.42.02-1_amd64.deb
dpkg -i cuda-repo-ubuntu2204-12-5-local_12.5.0-555.42.02-1_amd64.deb
cp /var/cuda-repo-ubuntu2204-12-5-local/cuda-*-keyring.gpg /usr/share/keyrings/
apt-get update
apt install nvidia-fabricmanager-555
snap install microk8s --classic --channel=1.30/stable
microk8s enable gpu

4. Information to attach (optional if deemed irrelevant)

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.

WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that NVIDIA kernel modules matching this driver version are installed separately.

WARNING: This NVIDIA driver package includes Vulkan components, but no Vulkan ICD loader was detected on this system. The NVIDIA Vulkan ICD will not function without the loader. Most distributions package the Vulkan loader; try installing the "vulkan-loader", "vulkan-icd-loader", or "libvulkan1" package.

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 535.129.03 for Linux kernel version 5.15.0-112-generic

Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs... Checking NVIDIA driver packages... Updating the package cache... Resolving Linux kernel version... Proceeding with Linux kernel version 5.15.0-112-generic Installing Linux kernel headers... Installing Linux kernel module files... Generating Linux kernel version string... Compiling NVIDIA driver kernel modules... warning: the compiler differs from the one used to build the kernel The kernel was built by: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 You are using: cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 /usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_perf_events_test.c: In function 'test_events': /usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_perf_events_test.c:83:1: warning: the frame size of 1048 bytes is larger than 1024 bytes [-Wframe-larger-than=] 83 | } | ^ /usr/src/nvidia-535.129.03/kernel/nvidia-drm/nvidia-drm-crtc.c: In function '__nv_drm_plane_atomic_destroy_state': /usr/src/nvidia-535.129.03/kernel/nvidia-drm/nvidia-drm-crtc.c:695:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] 695 | struct nv_drm_plane_state *nv_drm_plane_state = | ^~ /usr/src/nvidia-535.129.03/kernel/nvidia-peermem/nvidia-peermem.c: In function 'nv_mem_client_init': /usr/src/nvidia-535.129.03/kernel/nvidia-peermem/nvidia-peermem.c:490:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] 490 | int status = 0; | ^~~ ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict' make[2]: [scripts/Makefile.modpost:133: /usr/src/nvidia-535.129.03/kernel/Module.symvers] Error 1 make[2]: Deleting file '/usr/src/nvidia-535.129.03/kernel/Module.symvers' make[1]: [Makefile:1830: modules] Error 2 make: [Makefile:82: modules] Error 2 Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs...

 - [ ] Output from running `nvidia-smi` from the driver container: `microk8s kubectl exec nvidia-driver-daemonset-99xrx -n gpu-operator-resources -c nvidia-driver-ctr -- nvidia-smi`

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

command terminated with exit code 9


 - [ ] containerd logs `journalctl -u containerd > containerd.log`
 -- No entries --
chaunceyjiang commented 3 months ago
│ nvidia-driver-ctr /usr/src/nvidia-535.104.12/kernel/nvidia-uvm/uvm_perf_events_test.c: In function 'test_events':                                                                                                                                                                                                        │
│ nvidia-driver-ctr /usr/src/nvidia-535.104.12/kernel/nvidia-uvm/uvm_perf_events_test.c:83:1: warning: the frame size of 1048 bytes is larger than 1024 bytes [-Wframe-larger-than=]                                                                                                                                       │
│ nvidia-driver-ctr    83 | }                                                                                                                                                                                                                                                                                              │
│ nvidia-driver-ctr       | ^                                                                                                                                                                                                                                                                                              │
│ nvidia-driver-ctr /usr/src/nvidia-535.104.12/kernel/nvidia-drm/nvidia-drm-crtc.c: In function '__nv_drm_plane_atomic_destroy_state':                                                                                                                                                                                     │
│ nvidia-driver-ctr /usr/src/nvidia-535.104.12/kernel/nvidia-drm/nvidia-drm-crtc.c:695:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]                                                                                                                                             │
│ nvidia-driver-ctr   695 |     struct nv_drm_plane_state *nv_drm_plane_state =                                                                                                                                                                                                                                            │
│ nvidia-driver-ctr       |     ^~~~~~                                                                                                                                                                                                                                                                                     │
│ nvidia-driver-ctr /usr/src/nvidia-535.104.12/kernel/nvidia-peermem/nvidia-peermem.c: In function 'nv_mem_client_init':                                                                                                                                                                                                   │
│ nvidia-driver-ctr /usr/src/nvidia-535.104.12/kernel/nvidia-peermem/nvidia-peermem.c:462:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]                                                                                                                                          │
│ nvidia-driver-ctr   462 |     int status = 0;                                                                                                                                                                                                                                                                            │
│ nvidia-driver-ctr       |     ^~~                                                                                                                                                                                                                                                                                        │
│ nvidia-driver-ctr ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict'                                                                                                                                                                                                        │
│ nvidia-driver-ctr make[2]: *** [scripts/Makefile.modpost:133: /usr/src/nvidia-535.104.12/kernel/Module.symvers] Error 1                                                                                                                                                                                                  │
│ nvidia-driver-ctr make[2]: *** Deleting file '/usr/src/nvidia-535.104.12/kernel/Module.symvers'                                                                                                                                                                                                                          │
│ nvidia-driver-ctr make[1]: *** [Makefile:1830: modules] Error 2                                                                                                                                                                                                                                                          │
│ nvidia-driver-ctr make: *** [Makefile:82: modules] Error 2                                                                                                                                                                                                                                                               │
│ nvidia-driver-ctr Stopping NVIDIA persistence daemon...                                                                                                                                                                                                                                                                  │
│ nvidia-driver-ctr Unloading NVIDIA driver kernel modules...                                                                                                                                                                                                                                                              │
│ nvidia-driver-ctr Unmounting NVIDIA driver rootfs...

I also encountered the same problem.

haiph-dev commented 3 months ago

I found that different NVIDIA driver installed on host nvidia-555. I reinstalled nvidia-535 and it worked. Hope this help

cdesiniotis commented 2 months ago

The error ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict' is a known issue with newer kernels. This issue was fixed with driver versions >= 535.183.08. Closing this issue.

452256 commented 1 month ago

请问一下作者,我也是和你相似的状况,只不过我是不少pod处于init,其中负责安驱动的容器安不上驱动