NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.77k stars 286 forks source link

GPU drivers not installing with host kernel 6.8 and vGPU 16.5 (535.161.05) #718

Closed urbaman closed 2 months ago

urbaman commented 4 months ago

1. Quick Debug Information

2. Issue or feature description

Driver installation fails in VM on kernel 6.8 Host, vGPU driver 16.5, 535.161.05

3. Steps to reproduce the issue

Install vGPU 16.5, 535.161.05 on the host, then try gpu-operator

4. Information to attach (optional if deemed irrelevant)

nvidia-driver-daemonset-k59mv logs:

Updating the package cache...
Resolving Linux kernel version...
Proceeding with Linux kernel version 5.15.0-106-generic
Installing Linux kernel headers...
Installing Linux kernel module files...
Generating Linux kernel version string...
Compiling NVIDIA driver kernel modules...
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
  You are using:           cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
/usr/src/nvidia-535.129.03-grid/kernel/nvidia-uvm/uvm_perf_events_test.c: In function 'test_events':
/usr/src/nvidia-535.129.03-grid/kernel/nvidia-uvm/uvm_perf_events_test.c:83:1: warning: the frame size of 1048 bytes is larger than 1024 bytes [-Wframe-larger-than=]
   83 | }
      | ^
/usr/src/nvidia-535.129.03-grid/kernel/nvidia-drm/nvidia-drm-crtc.c: In function '__nv_drm_plane_atomic_destroy_state':
/usr/src/nvidia-535.129.03-grid/kernel/nvidia-drm/nvidia-drm-crtc.c:695:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  695 |     struct nv_drm_plane_state *nv_drm_plane_state =
      |     ^~~~~~
/usr/src/nvidia-535.129.03-grid/kernel/nvidia-peermem/nvidia-peermem.c: In function 'nv_mem_client_init':
/usr/src/nvidia-535.129.03-grid/kernel/nvidia-peermem/nvidia-peermem.c:490:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  490 |     int status = 0;
      |     ^~~
ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict'
make[2]: *** [scripts/Makefile.modpost:133: /usr/src/nvidia-535.129.03-grid/kernel/Module.symvers] Error 1
make[2]: *** Deleting file '/usr/src/nvidia-535.129.03-grid/kernel/Module.symvers'
make[1]: *** [Makefile:1830: modules] Error 2
make: *** [Makefile:82: modules] Error 2
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: operator_feedback@nvidia.com

kollachaitanyakrishna commented 4 months ago

A similar issue for me also. attaching the crash report

Azure VM: Linux 5.15.0-1063-azure x86_64 NAME="Ubuntu" VERSION="20.04.6 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.6 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal

nvidia-dkms-515.0.crash.txt

bqm1111 commented 4 months ago

I also encounter the same problem on Ubuntu 20.04, nvidia-driver-535.171.04, kernel 5.15.0-107-generic

vicaya commented 4 months ago

Appears to be a known issue for kernel upgrades. The current/stable nvidia driver version 550.x works fine.

bqm1111 commented 4 months ago

Appears to be a known issue for kernel upgrades. The current/stable nvidia driver version 550.x works fine.

How can I install nvidia-driver-550 on ubuntu 20.04?

Stephenfang51 commented 3 months ago

Appears to be a known issue for kernel upgrades. The current/stable nvidia driver version 550.x works fine.

How can I install nvidia-driver-550 on ubuntu 20.04?

Hi Did you solve your problem? same with yours :(

bqm1111 commented 3 months ago

Hi

You have to manually download the driver from this site.

2019211753 commented 3 months ago

Hi

You have to manually download the driver from this site.

Manually install works for me!

cdesiniotis commented 2 months ago

The following error was fixed in the 535.183.08 driver

ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict'

Closing this issue.