Open BartoszZawadzki opened 1 year ago
Additional info: apart from changing container runtime interface from docker to containerd, I have also tried different gpu-operator settings (values), with both CDI enabled/disabled, RDMA enabled/disabled and other - to no avail.
did you ever figure this out @BartoszZawadzki? Dealing with same issue on EKS and ubuntu
No, but since I'm using kops I've tried using this - https://kops.sigs.k8s.io/gpu/ and it worked out-of-the-box
also meet this problem. How to solve it?
failed to get sandbox runtime: no runtime for "nvidia"
this is a very generic error that happens when the container-toolkit is not able to apply the runtime config successfully or driver install is not working. Please look at the status/logs of nvidia-driver-daemonset
and nvidia-container-toolkit
pods to figure out the actual error.
@shivamerla
No, Rocky Linux is not supported currently.
failed to get sandbox runtime: no runtime for "nvidia"
this is a very generic error that happens when the container-toolkit is not able to apply the runtime config successfully or driver install is not working. Please look at the status/logs ofnvidia-driver-daemonset
andnvidia-container-toolkit
pods to figure out the actual error.
I have attached logs from all containers deployed via gpu-operator helm chart in the inital issue.
We're running into the same problem, the pods gpu-feature-discovery
, nvidia-operator-validator
, nvidia-dcgm-exporter
and nvidia-device-plugin-daemonset
are all not starting because of Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
nivia-gpu-operator-node-feature-discovery-worker log
nvidia-container-toolkit-daemonset log
EDIT: Our problem is this issue in containerd which makes it impossible to additively use the imports to configure containerd plugins. In our case we're configuring registry mirrors which in turn completely overrides nvidias runtime configuration. We're probably going to have to go the same route as nvidia, meaning we'd have to somehow parse the config.toml
, add our config and write it back.
Hi bro, I once encountered the same error. I'll give you my example for your reference. A week ago, I installed the nvidia driver, toolkits and device-plugin manually for test gpu running. I run containerd as runtime for kubelet, on ubuntu 22.04, then it works on cuda testing. A few days ago I tried gpu-operator installation, before that i uninstall nvidia driver, toolkits and device-plugin, and reverted the /etc/containerd/config.toml config. I got the same error as you.I had read many old issues about this err, then I found a committer of gpu-operator recommended lsmod | grep nvidia command, so I found some nvidia driver using by ubuntu kernel, meaned that uninstall imcompletely, so i reboot my host, and lsmod | grep nvidia command get nothing. Glad to say, everything is ok, all the nvidia pod become running. Hope useful to you !
May this problem is failed of symlink creation. I dont think it's a good way but you can avoid this issue by disabling symlink creation.
First you have to check your problem is from this situation.
kubectl logs -f nvidia-container-toolkit-daemonset-j8wcf -n gpu-operator-resources -c driver-validation
If right you can see error message like this below
time="2024-06-19T07:21:42Z" level=info msg="Error: error validating driver installation: error creating symlink creator: failed to create NVIDIA device nodes: failed to create device node nvidiactl: failed to determine major: invalid device node\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n validator:\n driver:\n env:\n - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n value: \"true\""
Now just follow that message.
summarize
kubectl edit clusterpolicies.nvidia.com
validator:
and add driver:
part.result is
validator:
driver:
env:
- name: DISABLE_DEV_CHAR_SYMLINK_CREATION
value: "true"
image: gpu-operator-validator
imagePullPolicy: IfNotPresent
plugin:
env:
- name: WITH_WORKLOAD
value: "false"
repository: nvcr.io/nvidia/cloud-native
version: v23.9.1
Hey guys I have the exact same error as mentioned by @ordinaryparksee
What's going on with these symlinks? I don't understand :/
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes?kubectl describe clusterpolicies --all-namespaces
)1. Issue or feature description
I'm deploying gpu-operator from Helm chart using ArgoCD in my Kubernetes cluster (1.23.17), which is built using kops on AWS infrastructure (not EKS).
Now I've been struggling with this for a while now, I've had used both docker and containerd in my Kubernetes cluster as a container runtime engine. I'm currently running containerd v1.6.21
After deploying the gpu-operator this is what is happening in the
gpu-operator
namespace:Getting into more details on the pods that are stuck in the init state:
kubectl -n gpu-operator describe po gpu-feature-discovery-jtgll
kubectl -n gpu-operator describe po nvidia-dcgm-exporter-bpvks
kubectl -n gpu-operator describe po nvidia-device-plugin-daemonset-fwwgr
kubectl -n gpu-operator describe po nvidia-operator-validator-qjgsb
And finally my ClusterPolicy:
2. Steps to reproduce the issue
Deploy gpu-operator using Helm chart (23.3.2)
3. Information to attach (optional if deemed irrelevant)
[ ] kubernetes pods status:
kubectl get pods --all-namespaces
[ ] kubernetes daemonset status:
kubectl get ds --all-namespaces
[ ] If a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAME
[ ] If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAME
[ ] Output of running a container on the GPU machine:
docker run -it alpine echo foo
[ ] Docker configuration file:
cat /etc/docker/daemon.json
[ ] Docker runtime configuration:
docker info | grep runtime
[ ] NVIDIA shared directory:
ls -la /run/nvidia
[ ] NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit
[ ] NVIDIA driver directory:
ls -la /run/nvidia/driver
[ ] kubelet logs
journalctl -u kubelet > kubelet.logs