Closed amrragab8080 closed 3 years ago
Figured out the issue in my case - the managed instance I was using had a nvidia-container-runtime/cli that was too low version to support the 460.32.03
driver.
I'm encountering this issue as well using EKS managed instances and the current AL2_x86_64_GPU
AMI. What ended up being your solution?
Did you update the nvidia-container-runtime manually on the managed node? If so, do you recal what that process was like?
Setup mig on the node I see I have MIG enabled and the compute slices have been created - I have 56 total slices using the 5gb profile. However on the K8s side when i deploy gfd and ndp I see both containers are failing with the same error:
Warning Failed 2m17s kubelet Error: failed to start container "nvidia-device-plugin-ctr": Error response from daemon: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --utility --pid=63912 /var/lib/docker/overlay2/66fc86911e6b9061961ccc8657a41e2ace06afe2cd48fdb8b365528c6a242c95/merged] nvidia-container-cli: detection error: cuda error: invalid device ordinal: unknown
Directly on the node I have the following: