Open yanis-incepto opened 3 weeks ago
@yanis-incepto the nvidia-container-toolkit-daemonset-vb6qn
is stuck in the init state and has not yet configured the nvidia
runtime in containerd. Could you provide the logs for the containers in this daemonset?
nvidia-container-toolkit is finally running after some timebut still the error with the others (and it never gets away, i tried letting everyuthing during a few hours) :
kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-t4bv8 0/1 Init:0/1 0 10m
gpu-operator-d97f85598-j7qt4 1/1 Running 0 7d1h
gpu-operator-node-feature-discovery-gc-84c477b7-67tk8 1/1 Running 0 6d21h
gpu-operator-node-feature-discovery-master-cb8bb7d48-x4hqj 1/1 Running 0 6d21h
gpu-operator-node-feature-discovery-worker-fcwh7 1/1 Running 0 10m
nvidia-container-toolkit-daemonset-gn495 1/1 Running 0 10m
nvidia-dcgm-exporter-wnhss 0/1 Init:0/1 0 10m
nvidia-device-plugin-daemonset-dwwqr 0/1 Init:0/1 0 10m
nvidia-driver-daemonset-p47wp 1/1 Running 0 10m
nvidia-operator-validator-zk4mv 0/1 Init:0/4 0 10m
For his logs : it looks likle he's waiting for a signal :
kubectl logs -n gpu-operator nvidia-container-toolkit-daemonset-gn495
Defaulted container "nvidia-container-toolkit-ctr" out of: nvidia-container-toolkit-ctr, driver-validation (init)
time="2024-06-03T10:46:29Z" level=info msg="Parsing arguments"
time="2024-06-03T10:46:29Z" level=info msg="Starting nvidia-toolkit"
time="2024-06-03T10:46:29Z" level=info msg="Verifying Flags"
time="2024-06-03T10:46:29Z" level=info msg=Initializing
time="2024-06-03T10:46:29Z" level=info msg="Installing toolkit"
time="2024-06-03T10:46:29Z" level=info msg="disabling device node creation since --cdi-enabled=false"
time="2024-06-03T10:46:29Z" level=info msg="Installing NVIDIA container toolkit to '/usr/local/nvidia/toolkit'"
time="2024-06-03T10:46:29Z" level=info msg="Removing existing NVIDIA container toolkit installation"
time="2024-06-03T10:46:29Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit'"
time="2024-06-03T10:46:29Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime'"
time="2024-06-03T10:46:29Z" level=info msg="Installing NVIDIA container library to '/usr/local/nvidia/toolkit'"
time="2024-06-03T10:46:29Z" level=info msg="Finding library libnvidia-container.so.1 (root=)"
time="2024-06-03T10:46:29Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container.so.1'"
time="2024-06-03T10:46:29Z" level=info msg="Skipping library candidate '/usr/lib64/libnvidia-container.so.1': error resolving link '/usr/lib64/libnvidia-container.so.1': lstat /usr/lib64/libnvidia-container.so.1: no such file or directory"
time="2024-06-03T10:46:29Z" level=info msg="Checking library candidate '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1'"
time="2024-06-03T10:46:29Z" level=info msg="Resolved link: '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1' => '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1.15.0'"
time="2024-06-03T10:46:29Z" level=info msg="Installing '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1.15.0' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.15.0'"
time="2024-06-03T10:46:29Z" level=info msg="Installed '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1.15.0' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.15.0'"
time="2024-06-03T10:46:29Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container.so.1' -> 'libnvidia-container.so.1.15.0'"
time="2024-06-03T10:46:29Z" level=info msg="Finding library libnvidia-container-go.so.1 (root=)"
time="2024-06-03T10:46:29Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container-go.so.1'"
time="2024-06-03T10:46:29Z" level=info msg="Skipping library candidate '/usr/lib64/libnvidia-container-go.so.1': error resolving link '/usr/lib64/libnvidia-container-go.so.1': lstat /usr/lib64/libnvidia-container-go.so.1: no such file or directory"
time="2024-06-03T10:46:29Z" level=info msg="Checking library candidate '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1'"
time="2024-06-03T10:46:29Z" level=info msg="Resolved link: '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1' => '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1.15.0'"
time="2024-06-03T10:46:29Z" level=info msg="Installing '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1.15.0' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.15.0'"
time="2024-06-03T10:46:29Z" level=info msg="Installed '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1.15.0' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.15.0'"
time="2024-06-03T10:46:29Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1' -> 'libnvidia-container-go.so.1.15.0'"
time="2024-06-03T10:46:29Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime' to /usr/local/nvidia/toolkit"
time="2024-06-03T10:46:29Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime'"
time="2024-06-03T10:46:29Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime.cdi' to /usr/local/nvidia/toolkit"
time="2024-06-03T10:46:29Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime.cdi' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi'"
time="2024-06-03T10:46:29Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime.legacy' to /usr/local/nvidia/toolkit"
time="2024-06-03T10:46:29Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime.legacy' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy'"
time="2024-06-03T10:46:29Z" level=info msg="Installing NVIDIA container CLI from '/usr/bin/nvidia-container-cli'"
time="2024-06-03T10:46:29Z" level=info msg="Installing executable '/usr/bin/nvidia-container-cli' to /usr/local/nvidia/toolkit"
time="2024-06-03T10:46:29Z" level=info msg="Installing '/usr/bin/nvidia-container-cli' to '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-cli'"
time="2024-06-03T10:46:29Z" level=info msg="Installing NVIDIA container runtime hook from '/usr/bin/nvidia-container-runtime-hook'"
time="2024-06-03T10:46:29Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime-hook' to /usr/local/nvidia/toolkit"
time="2024-06-03T10:46:29Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime-hook' to '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook'"
time="2024-06-03T10:46:29Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/nvidia-container-toolkit' -> 'nvidia-container-runtime-hook'"
time="2024-06-03T10:46:29Z" level=info msg="Installing executable '/usr/bin/nvidia-ctk' to /usr/local/nvidia/toolkit"
time="2024-06-03T10:46:29Z" level=info msg="Installing '/usr/bin/nvidia-ctk' to '/usr/local/nvidia/toolkit/nvidia-ctk.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-ctk.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-ctk'"
time="2024-06-03T10:46:29Z" level=info msg="Installing NVIDIA container toolkit config '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml'"
time="2024-06-03T10:46:29Z" level=info msg="Skipping unset option: nvidia-container-runtime.modes.cdi.annotation-prefixes"
time="2024-06-03T10:46:29Z" level=info msg="Skipping unset option: nvidia-container-runtime.runtimes"
time="2024-06-03T10:46:29Z" level=info msg="Skipping unset option: nvidia-container-cli.debug"
time="2024-06-03T10:46:29Z" level=info msg="Skipping unset option: nvidia-container-runtime.debug"
time="2024-06-03T10:46:29Z" level=info msg="Skipping unset option: nvidia-container-runtime.log-level"
time="2024-06-03T10:46:29Z" level=info msg="Skipping unset option: nvidia-container-runtime.mode"
Using config:
accept-nvidia-visible-devices-as-volume-mounts = false
accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
[nvidia-container-cli]
environment = []
ldconfig = "@/run/nvidia/driver/sbin/ldconfig.real"
load-kmods = true
path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
root = "/run/nvidia/driver"
[nvidia-container-runtime]
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc", "crun"]
[nvidia-container-runtime.modes]
[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "management.nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]
[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
[nvidia-container-runtime-hook]
path = "/usr/local/nvidia/toolkit/nvidia-container-runtime-hook"
skip-mode-detection = true
[nvidia-ctk]
path = "/usr/local/nvidia/toolkit/nvidia-ctk"
time="2024-06-03T10:46:29Z" level=info msg="Setting up runtime"
time="2024-06-03T10:46:29Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]"
time="2024-06-03T10:46:29Z" level=info msg="Successfully parsed arguments"
time="2024-06-03T10:46:29Z" level=info msg="Starting 'setup' for containerd"
time="2024-06-03T10:46:29Z" level=info msg="Config file does not exist; using empty config"
time="2024-06-03T10:46:29Z" level=info msg="Flushing config to /runtime/config-dir/config.toml"
time="2024-06-03T10:46:29Z" level=info msg="Sending SIGHUP signal to containerd"
time="2024-06-03T10:46:29Z" level=info msg="Successfully signaled containerd"
time="2024-06-03T10:46:29Z" level=info msg="Completed 'setup' for containerd"
time="2024-06-03T10:46:29Z" level=info msg="Waiting for signal"
Please restart the nvidia-operator-validator-zk4mv
pod to start with. If this proceeds, then restart the other pods too.
just recreated the pod, still same issue
is there a compatibility table for gpu-operator ? Maybe latest version is not compatible with kubernetes 1.24.14 ?
I had this issue when my /etc/containerd/config.toml
was incorrect (was missing runc from default). This is what it looks like now on each node:
version = 2
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
BinaryName = "/usr/bin/runc"
I had this issue when my
/etc/containerd/config.toml
was incorrect (was missing runc from default). This is what it looks like now on each node:version = 2 [plugins] [plugins."io.containerd.grpc.v1.cri"] [plugins."io.containerd.grpc.v1.cri".containerd] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia] privileged_without_host_devices = false runtime_engine = "" runtime_root = "" runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options] BinaryName = "/usr/bin/nvidia-container-runtime" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc] privileged_without_host_devices = false runtime_engine = "" runtime_root = "" runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options] BinaryName = "/usr/bin/runc"
Hello, Thanks for your help, but unfortunately, i just tried but it didn't work
1. Quick Debug Information
2. Issue or feature description
It looks like the runtime isn't present as it's not found but it exists.
3. Steps to reproduce the issue
I installed the chart with helmfile
4. Information to attach (optional if deemed irrelevant)
kubernetes pods status:
kubectl get pods -n OPERATOR_NAMESPACE
kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE
If a pod/ds is in an error state or pending state
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
Output from running
nvidia-smi
from the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi