canonical / microk8s-core-addons

Core MicroK8s addons
Apache License 2.0
45 stars 34 forks source link

GPU operator DKMS build failure on 22.04 #303

Open VariableDeclared opened 1 month ago

VariableDeclared commented 1 month ago

Summary

When deploying Microk8s on an 22.04 Ubuntu enabled AWS machine a DKMS compile error is thrown:

/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_perf_events_test.c: In function 'test_events':
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_perf_events_test.c:83:1: warning: the frame size of 1048 bytes is larger than 1024 bytes [-Wframe-larger-than=]
   83 | }
      | ^
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_va_block.c: In function 'uvm_va_block_check_logical_permissions':
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_va_block.c:10755:60: warning: implicit conversion from 'uvm_fault_type_t' to 'uvm_fault_access_type_t' [-Wenum-conversion]
10755 |     uvm_prot_t access_prot = uvm_fault_access_type_to_prot(access_type);
      |                                                            ^~~~~~~~~~~
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_va_block.c: In function 'block_cpu_fault_locked':
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_va_block.c:10890:53: warning: implicit conversion from 'uvm_fault_access_type_t' to 'uvm_fault_type_t' [-Wenum-conversion]
10890 |                                                     fault_access_type,
      |                                                     ^~~~~~~~~~~~~~~~~
make[2]: *** [/usr/src/linux-headers-6.8.0-1015-aws/Makefile:1925: /usr/src/nvidia-535.129.03/kernel] Error 2
make[1]: *** [Makefile:240: __sub-make] Error 2
make: *** [Makefile:82: modules] Error 2
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

This is likely due the older operator deploying some older versions of the driver which are missing the correct signatures for the later kernels. Deploying with the latest operator - it is able to deploy successfully: microk8s enable gpu --version 24.6.2

gpu-operator-resources gpu-operator-node-feature-discovery-worker-pntfz 1/1 Running 0 9m3s gpu-operator-resources gpu-operator-node-feature-discovery-worker-xcgxn 1/1 Running 0 9m3s gpu-operator-resources gpu-operator-node-feature-discovery-worker-xxdlt 1/1 Running 0 9m3s gpu-operator-resources nvidia-container-toolkit-daemonset-hv4hc 1/1 Running 0 8m38s gpu-operator-resources nvidia-cuda-validator-cpkb7 0/1 Completed 0 3m54s gpu-operator-resources nvidia-dcgm-exporter-s762v 1/1 Running 0 8m38s gpu-operator-resources nvidia-device-plugin-daemonset-lh97z 1/1 Running 0 8m38s gpu-operator-resources nvidia-driver-daemonset-t84r4 1/1 Running 0 8m44s gpu-operator-resources nvidia-operator-validator-8cnnk 1/1 Running 0 8m38s ingress nginx-ingress-microk8s-controller-f5v8r 1/1 Running 0 85m

inspection-report-20241004_143256.tar.gz

Reproduction Steps

  1. Deploy a GPU enabled machine juju add-machine --constraints='instance-type=g4dn.xlarge root-disk=100G'
  2. Microk8s enable gpu
  3. The daemonset will crash with a DKMS compile error

Introspection Report

Can you suggest a fix?

Change the default version to 24.6.2

https://github.com/canonical/microk8s-core-addons/blob/main/addons/nvidia/enable#L216

Are you interested in contributing with a fix?

VariableDeclared commented 1 month ago

Opened PR: https://github.com/canonical/microk8s-core-addons/pull/305