When deploying Microk8s on an 22.04 Ubuntu enabled AWS machine a DKMS compile error is thrown:
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_perf_events_test.c: In function 'test_events':
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_perf_events_test.c:83:1: warning: the frame size of 1048 bytes is larger than 1024 bytes [-Wframe-larger-than=]
83 | }
| ^
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_va_block.c: In function 'uvm_va_block_check_logical_permissions':
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_va_block.c:10755:60: warning: implicit conversion from 'uvm_fault_type_t' to 'uvm_fault_access_type_t' [-Wenum-conversion]
10755 | uvm_prot_t access_prot = uvm_fault_access_type_to_prot(access_type);
| ^~~~~~~~~~~
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_va_block.c: In function 'block_cpu_fault_locked':
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_va_block.c:10890:53: warning: implicit conversion from 'uvm_fault_access_type_t' to 'uvm_fault_type_t' [-Wenum-conversion]
10890 | fault_access_type,
| ^~~~~~~~~~~~~~~~~
make[2]: *** [/usr/src/linux-headers-6.8.0-1015-aws/Makefile:1925: /usr/src/nvidia-535.129.03/kernel] Error 2
make[1]: *** [Makefile:240: __sub-make] Error 2
make: *** [Makefile:82: modules] Error 2
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
This is likely due the older operator deploying some older versions of the driver which are missing the correct signatures for the later kernels. Deploying with the latest operator - it is able to deploy successfully:
microk8s enable gpu --version 24.6.2
Summary
When deploying Microk8s on an 22.04 Ubuntu enabled AWS machine a DKMS compile error is thrown:
This is likely due the older operator deploying some older versions of the driver which are missing the correct signatures for the later kernels. Deploying with the latest operator - it is able to deploy successfully: microk8s enable gpu --version 24.6.2
gpu-operator-resources gpu-operator-node-feature-discovery-worker-pntfz 1/1 Running 0 9m3s gpu-operator-resources gpu-operator-node-feature-discovery-worker-xcgxn 1/1 Running 0 9m3s gpu-operator-resources gpu-operator-node-feature-discovery-worker-xxdlt 1/1 Running 0 9m3s gpu-operator-resources nvidia-container-toolkit-daemonset-hv4hc 1/1 Running 0 8m38s gpu-operator-resources nvidia-cuda-validator-cpkb7 0/1 Completed 0 3m54s gpu-operator-resources nvidia-dcgm-exporter-s762v 1/1 Running 0 8m38s gpu-operator-resources nvidia-device-plugin-daemonset-lh97z 1/1 Running 0 8m38s gpu-operator-resources nvidia-driver-daemonset-t84r4 1/1 Running 0 8m44s gpu-operator-resources nvidia-operator-validator-8cnnk 1/1 Running 0 8m38s ingress nginx-ingress-microk8s-controller-f5v8r 1/1 Running 0 85m
inspection-report-20241004_143256.tar.gz
Reproduction Steps
juju add-machine --constraints='instance-type=g4dn.xlarge root-disk=100G'
Introspection Report
Can you suggest a fix?
Change the default version to 24.6.2
https://github.com/canonical/microk8s-core-addons/blob/main/addons/nvidia/enable#L216
Are you interested in contributing with a fix?