Closed lolwww closed 21 hours ago
@lolwww Can you try the following
kubectl set env -n gpu-operator-resources ds nvidia-operator-validator -c nvidia-operator-validator DISABLE_DEV_CHAR_SYMLINK_CREATION=true
@gustavosr98 didn't help Gustavo. Same result after executing it.
When checking the logs inside the VM this looks interesting to me
[..] level=info msg="Error: error validating driver installation: error creating symlink creator: failed to create NVIDIA device nodes: failed to create device node nvidiactl: failed to determine major: invalid device node\n\n
It feels like the operator is trying to create /dev/nvidiactl
but when checking I saw it was already on the machine
After trying many things I got it working with a latest current version of nvidia-gpu-operator, as the default one does not support H100 and nvidia-driver-550.54.14: microk8s enable gpu --version v24.3.0
and a newer testing vectoradd image:
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vectoradd
image: "nvidia/samples:vectoradd-cuda11.2.1"
resources:
limits:
nvidia.com/gpu: 1 # To request physical full GPUs
# nvidia.com/mig-1g.5gb: 1 # To request MIG GPUs
kubectl logs pod/vectoradd [Vector addition of 50000 elements] Test PASSED
Summary
I am running microk8s v1.28.9 on azure VM with H100. Nvidia gpu-operator:v23.9.1. Nvidia-smi shows gpu is ok:
microk8s enable gpu also passes ok. I have also tried disable-enable gpu addon and reboot the host with no result.
However if I run a simple GPU test it fails (see below)
What Should Happen Instead?
GPU should work as expected.
Reproduction Steps
Introspection Report
inspection-report-20240625_140304.tar.gz
Can you suggest a fix?
The error suggests it has to do with https://github.com/NVIDIA/gpu-operator/issues/430. Which is open, but suggests some manual symlinks workaround. However I have not been able to figure out which symlinks to create exactly to make it work. Workaround suggestions are welcomed, thank you.