Closed JunyaTaniai closed 1 year ago
Hi @JunyaTaniai is the Network Operator installed and MOFED driver deployed? The GPU driver container waits for MOFED driver to be ready first before proceeding. And setting driver.rdma.useHostMofed=false
indicates to the GPU Operator that MOFED driver is being installed via container (and not on the host) -- in this case we wait for the status file /run/mellanox/drivers/.driver-ready
to be created before proceeding.
From your logs it appears this file is not being created:
$ sudo cat /var/log/pods/gpu-operator_nvidia-driver-daemonset-k8tvc_d6d1d91a-5e0e-4701-a01d-9a0585c5232e/mofed-validation/0.log
2022-11-22T05:38:21.156056921+00:00 stdout F running command bash with args [-c stat /run/mellanox/drivers/.driver-ready]
2022-11-22T05:38:21.160991578+00:00 stderr F stat: cannot statx '/run/mellanox/drivers/.driver-ready': No such file or directory
2022-11-22T05:38:21.161307447+00:00 stdout F command failed, retrying after 5 seconds
Hi, @cdesiniotis Thanks for the answer. I have deployed the MOFED driver via Network Operator. However, several resources are still stuck in Init.。
$ k -n network-operator get po
NAME READY STATUS RESTARTS AGE
mofed-ubuntu20.04-ds-2qct7 1/1 Running 0 3h13m
network-operator-688dd4f45-g787x 1/1 Running 0 3h14m
network-operator-node-feature-discovery-master-648d59648-znkwr 1/1 Running 0 3h14m
network-operator-node-feature-discovery-worker-m7lb6 1/1 Running 0 3h14m
rdma-shared-dp-ds-mrg2s 1/1 Running 0 3h6m
$ k -n gpu-operator logs nvidia-driver-daemonset-xmf7r mofed-validation
jyunya-taniai@mito26:~$ sudo cat /var/log/pods/gpu-operator_nvidia-driver-daemonset-xmf7r_c12dbbf4-5fb7-4565-a04f-e6a3fdde8318/mofed-validation/0.log
running command bash with args [-c stat /run/mellanox/drivers/.driver-ready]
File: /run/mellanox/drivers/.driver-ready
Size: 0 Blocks: 0 IO Block: 4096 regular empty file
Device: 100061h/1048673d Inode: 23335252 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2022-11-24 00:21:00.169920830 +0000
Modify: 2022-11-24 00:21:00.169920830 +0000
Change: 2022-11-24 00:21:00.169920830 +0000
Birth: 2022-11-24 00:21:00.169920830 +0000
$ k -n gpu-operator get po
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-7tx82 0/1 Init:0/1 0 17m
gpu-operator-5dc6b8989b-h5wpw 1/1 Running 0 17m
gpu-operator-node-feature-discovery-master-65c9bd48c4-rzpzn 1/1 Running 0 17m
gpu-operator-node-feature-discovery-worker-42qkd 1/1 Running 0 17m
nvidia-container-toolkit-daemonset-jtppl 1/1 Running 0 17m
nvidia-dcgm-exporter-7vzwf 0/1 Init:0/1 0 17m
nvidia-device-plugin-daemonset-fdnfc 0/1 Init:0/1 0 17m
nvidia-driver-daemonset-xmf7r 2/2 Running 1 (16m ago) 17m
nvidia-operator-validator-kkvtx 0/1 Init:CrashLoopBackOff 8 (15s ago) 17m
Checking the logs for the resource that is stuck in Init, it appears that the nvidia-container-toolkit validation is not succeeding. Also, the toolkit-validation container log for the nvidia-operator-validator resource is outputting that the nvidia-smi command cannot be executed.
$ k -n gpu-operator logs gpu-feature-discovery-7tx82 gpu-feature-discovery
Error from server (BadRequest): container "gpu-feature-discovery" in pod "gpu-feature-discovery-7tx82" is waiting to start: PodInitializing
$ k -n gpu-operator logs gpu-feature-discovery-7tx82 toolkit-validation
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
$ k -n gpu-operator logs nvidia-dcgm-exporter-7vzwf nvidia-dcgm-exporter
Error from server (BadRequest): container "nvidia-dcgm-exporter" in pod "nvidia-dcgm-exporter-7vzwf" is waiting to start: PodInitializing
$ k -n gpu-operator logs nvidia-dcgm-exporter-7vzwf toolkit-validation
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
$ k -n gpu-operator logs nvidia-device-plugin-daemonset-fdnfc nvidia-device-plugin
Error from server (BadRequest): container "nvidia-device-plugin" in pod "nvidia-device-plugin-daemonset-fdnfc" is waiting to start: PodInitializing
$ k -n gpu-operator logs nvidia-device-plugin-daemonset-fdnfc toolkit-validation
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
$ k -n gpu-operator logs nvidia-operator-validator-kkvtx toolkit-validation
toolkit is not ready
time="2022-11-24T02:29:09Z" level=info msg="Error: error validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH"
Checking the nvidia-container-toolkit log, it appears that the driver validation is successful, but the nvidia-container-toolkit-ctr container log shows that the links to some files are not successful.
$ k -n gpu-operator logs nvidia-container-toolkit-daemonset-jtppl driver-validation
Thu Nov 24 02:13:18 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... On | 00000000:65:00.0 Off | 0 |
| N/A 49C P0 38W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
$ k -n gpu-operator logs nvidia-container-toolkit-daemonset-jtppl nvidia-container-toolkit-ctr time="2022-11-24T02:13:36Z" level=info msg="Starting nvidia-toolkit" time="2022-11-24T02:13:36Z" level=info msg="Parsing arguments" time="2022-11-24T02:13:36Z" level=info msg="Verifying Flags" time="2022-11-24T02:13:36Z" level=info msg=Initializing time="2022-11-24T02:13:36Z" level=info msg="Installing toolkit" time="2022-11-24T02:13:36Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]" time="2022-11-24T02:13:36Z" level=info msg="Successfully parsed arguments" time="2022-11-24T02:13:36Z" level=info msg="Installing NVIDIA container toolkit to '/usr/local/nvidia/toolkit'" time="2022-11-24T02:13:36Z" level=info msg="Removing existing NVIDIA container toolkit installation" time="2022-11-24T02:13:36Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit'" time="2022-11-24T02:13:36Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime'" time="2022-11-24T02:13:36Z" level=info msg="Installing NVIDIA container library to '/usr/local/nvidia/toolkit'" time="2022-11-24T02:13:36Z" level=info msg="Finding library libnvidia-container.so.1 (root=)" time="2022-11-24T02:13:36Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container.so.1'" time="2022-11-24T02:13:36Z" level=info msg="Skipping library candidate '/usr/lib64/libnvidia-container.so.1': error resolving link '/usr/lib64/libnvidia-container.so.1': lstat /usr/lib64/libnvidia-container.so.1: no such file or directory" time="2022-11-24T02:13:36Z" level=info msg="Checking library candidate '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1'" time="2022-11-24T02:13:36Z" level=info msg="Resolved link: '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1' => '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1.9.0'" time="2022-11-24T02:13:36Z" level=info msg="Installing '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1.9.0' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.9.0'" time="2022-11-24T02:13:36Z" level=info msg="Installed '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1.9.0' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.9.0'" time="2022-11-24T02:13:36Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container.so.1' -> 'libnvidia-container.so.1.9.0'" time="2022-11-24T02:13:36Z" level=info msg="Finding library libnvidia-container-go.so.1 (root=)" time="2022-11-24T02:13:36Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container-go.so.1'" time="2022-11-24T02:13:36Z" level=info msg="Skipping library candidate '/usr/lib64/libnvidia-container-go.so.1': error resolving link '/usr/lib64/libnvidia-container-go.so.1': lstat /usr/lib64/libnvidia-container-go.so.1: no such file or directory" time="2022-11-24T02:13:36Z" level=info msg="Checking library candidate '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1'" time="2022-11-24T02:13:36Z" level=info msg="Resolved link: '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1' => '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1.9.0'" time="2022-11-24T02:13:36Z" level=info msg="Installing '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1.9.0' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.9.0'" time="2022-11-24T02:13:36Z" level=info msg="Installed '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1.9.0' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.9.0'" time="2022-11-24T02:13:36Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1' -> 'libnvidia-container-go.so.1.9.0'" time="2022-11-24T02:13:36Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime' to /usr/local/nvidia/toolkit" time="2022-11-24T02:13:36Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'" time="2022-11-24T02:13:36Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'" time="2022-11-24T02:13:36Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime'" time="2022-11-24T02:13:36Z" level=info msg="Finding library libnvidia-ml.so (root=/run/nvidia/driver)" time="2022-11-24T02:13:36Z" level=info msg="Checking library candidate '/run/nvidia/driver/usr/lib64/libnvidia-ml.so'" time="2022-11-24T02:13:36Z" level=info msg="Skipping library candidate '/run/nvidia/driver/usr/lib64/libnvidia-ml.so': error resolving link '/run/nvidia/driver/usr/lib64/libnvidia-ml.so': lstat /run/nvidia/driver/usr/lib64/libnvidia-ml.so: no such file or directory" time="2022-11-24T02:13:36Z" level=info msg="Checking library candidate '/run/nvidia/driver/usr/lib/x86_64-linux-gnu/libnvidia-ml.so'" time="2022-11-24T02:13:36Z" level=info msg="Resolved link: '/run/nvidia/driver/usr/lib/x86_64-linux-gnu/libnvidia-ml.so' => '/run/nvidia/driver/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.520.61.05'" time="2022-11-24T02:13:36Z" level=info msg="Using library root /run/nvidia/driver/usr/lib/x86_64-linux-gnu" time="2022-11-24T02:13:36Z" level=info msg="Installing executable 'nvidia-container-runtime.experimental' to /usr/local/nvidia/toolkit" time="2022-11-24T02:13:36Z" level=info msg="Installing 'nvidia-container-runtime.experimental' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.experimental'" time="2022-11-24T02:13:36Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.experimental'" time="2022-11-24T02:13:36Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental'" time="2022-11-24T02:13:36Z" level=info msg="Installing NVIDIA container CLI from '/usr/bin/nvidia-container-cli'" time="2022-11-24T02:13:36Z" level=info msg="Installing executable '/usr/bin/nvidia-container-cli' to /usr/local/nvidia/toolkit" time="2022-11-24T02:13:36Z" level=info msg="Installing '/usr/bin/nvidia-container-cli' to '/usr/local/nvidia/toolkit/nvidia-container-cli.real'" time="2022-11-24T02:13:36Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-cli.real'" time="2022-11-24T02:13:36Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-cli'" time="2022-11-24T02:13:36Z" level=info msg="Installing NVIDIA container runtime hook from '/usr/bin/nvidia-container-toolkit'" time="2022-11-24T02:13:36Z" level=info msg="Installing executable '/usr/bin/nvidia-container-toolkit' to /usr/local/nvidia/toolkit" time="2022-11-24T02:13:36Z" level=info msg="Installing '/usr/bin/nvidia-container-toolkit' to '/usr/local/nvidia/toolkit/nvidia-container-toolkit.real'" time="2022-11-24T02:13:36Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-toolkit.real'" time="2022-11-24T02:13:36Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-toolkit'" time="2022-11-24T02:13:36Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook' -> 'nvidia-container-toolkit'" time="2022-11-24T02:13:36Z" level=info msg="Installing NVIDIA container toolkit config '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml'" time="2022-11-24T02:13:36Z" level=info msg="Setting up runtime" time="2022-11-24T02:13:36Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]" time="2022-11-24T02:13:36Z" level=info msg="Successfully parsed arguments" time="2022-11-24T02:13:36Z" level=info msg="Starting 'setup' for crio" time="2022-11-24T02:13:36Z" level=info msg="Waiting for signal"
I checked each file on the node and sure enough, it did not seem to exist. Also, the nvidia-smi command does not succeed in executing because libnvidia-ml.so is not present.
$ ll /usr/lib64/libnvidia-container.so.1 ls: cannot access '/usr/lib64/libnvidia-container.so.1': No such file or directory
$ ll /usr/lib64/libnvidia-container-go.so.1 ls: cannot access '/usr/lib64/libnvidia-container-go.so.1': No such file or directory
$ ll /usr/lib64/ total 8 drwxr-xr-x 2 root root 4096 Oct 10 23:35 ./ drwxr-xr-x 14 root root 4096 Oct 10 23:36 ../ lrwxrwxrwx 1 root root 32 Apr 7 2022 ld-linux-x86-64.so.2 -> /lib/x86_64-linux-gnu/ld-2.31.so*
$ ll /run/nvidia/driver/usr/lib64/libnvidia-ml.so ls: cannot access '/run/nvidia/driver/usr/lib64/libnvidia-ml.so': No such file or directory
$ ll /run/nvidia/driver/usr/lib64/ total 12 drwxr-xr-x 2 root root 4096 Aug 1 13:25 ./ drwxr-xr-x 1 root root 4096 Aug 1 13:22 ../ lrwxrwxrwx 1 root root 32 Apr 7 2022 ld-linux-x86-64.so.2 -> /lib/x86_64-linux-gnu/ld-2.31.so*
$ ll /run/nvidia/driver/bin/ | grep nvidia -rwxr-xr-x 1 root root 38025 Nov 24 02:12 nvidia-bug-report.sh -rwxr-xr-x 1 root root 50024 Nov 24 02:12 nvidia-cuda-mps-control -rwxr-xr-x 1 root root 14560 Nov 24 02:12 nvidia-cuda-mps-server -rwxr-xr-x 1 root root 137904 Nov 24 02:12 nvidia-debugdump -rwxr-xr-x 1 root root 355344 Nov 24 02:12 nvidia-installer -rwxr-xr-x 1 root root 3892272 Nov 24 02:12 nvidia-ngx-updater -rwxr-xr-x 1 root root 208336 Nov 24 02:12 nvidia-persistenced -rwxr-xr-x 1 root root 585216 Nov 24 02:12 nvidia-powerd -rwxr-xr-x 1 root root 302216 Nov 24 02:12 nvidia-settings -rwxr-xr-x 1 root root 900 Nov 24 02:12 nvidia-sleep.sh -rwxr-xr-x 1 root root 600760 Nov 24 02:12 nvidia-smi lrwxrwxrwx 1 root root 16 Nov 24 02:12 nvidia-uninstall -> nvidia-installer -rwxr-xr-x 1 root root 207424 Nov 24 02:12 nvidia-xconfig*
$ ll /run/nvidia/driver/bin/nvidia-smi -rwxr-xr-x 1 root root 600760 Nov 24 02:12 /run/nvidia/driver/bin/nvidia-smi*
$ /run/nvidia/driver/bin/nvidia-smi -L NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system. Please also try adding directory that contains libnvidia-ml.so to your system PATH.
Is there a way to get nvidia-container-toolkit validation to succeed so that everything is RUNNING?
@JunyaTaniai I suspect CRIO is not configured to look for OCI hooks in /run/containers/oci/hooks.d
. If so, the NVIDIA Container Runtime Hook will not be used (we currently put our hook in this directory). Can you try this workaround? https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#custom-configuration-for-runtime-cri-o. After restarting CRIO, restart the operator-validator
pod to restart the toolkit validation.
Hi, @cdesiniotis Thank you for your answer.I have been using cri-o 1.25 on Ubuntu 20.04, is this configuration supported? https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/platform-support.html#supported-container-runtimes If it appears that even the latest version of GPU-Operator is not supported, I will reinstall k8s with containerd and configure the following. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#custom-configuration-for-runtime-containerd
Hi, @cdesiniotis I had to redo it with containerd because I was using cri-o on Ubuntu 20.04. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/platform-support.html#supported-container-runtimes
I also redeployed GPU-Operator by giving helm the values for container-toolkit based on the following documentation https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#custom-configuration-for-runtime-containerd
$ helm install --wait gpu-operator \
> -n gpu-operator --create-namespace \
> nvidia/gpu-operator \
> --set driver.enabled=true \
> --set driver.version=520.61.05 \
> --set driver.rdma.enabled=true \
> --set migManager.enabled=true \
> --set mig.strategy=single \
> --set driver.rdma.useHostMofed=false \
> --set nfd.enabled=true \
> --set toolkit.enabled=true \
> --set toolkit.version="devel-ubuntu20.04" \
> --set toolkit.env[0].name=CONTAINERD_CONFIG \
> --set toolkit.env[0].value=/etc/containerd/config.toml \
> --set toolkit.env[1].name=CONTAINERD_SOCKET \
> --set toolkit.env[1].value=/run/containerd/containerd.sock \
> --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
> --set toolkit.env[2].value=nvidia \
> --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
> --set-string toolkit.env[3].value=true
NAME: gpu-operator
LAST DEPLOYED: Tue Nov 29 09:01:27 2022
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None
However, now container-toolkit repeatedly restarts.
$ k -n gpu-operator get po -w
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-km5mf 0/1 Init:0/1 0 24s
gpu-operator-5dc6b8989b-kfxkb 1/1 Running 0 42s
gpu-operator-node-feature-discovery-master-65c9bd48c4-ghcnx 1/1 Running 0 42s
gpu-operator-node-feature-discovery-worker-sdvp6 1/1 Running 0 42s
nvidia-container-toolkit-daemonset-478rz 0/1 Init:0/1 0 24s
nvidia-dcgm-exporter-kwpjv 0/1 Init:0/1 0 24s
nvidia-device-plugin-daemonset-7rppw 0/1 Init:0/1 0 24s
nvidia-driver-daemonset-54wbd 0/2 Running 0 24s
nvidia-mig-manager-zrmrn 0/1 Init:0/1 0 24s
nvidia-operator-validator-nbqrc 0/1 Init:0/4 0 19s
nvidia-driver-daemonset-54wbd 0/2 Running 0 91s
nvidia-driver-daemonset-54wbd 1/2 Running 0 91s
nvidia-driver-daemonset-54wbd 1/2 Running 0 91s
nvidia-driver-daemonset-54wbd 2/2 Running 0 91s
nvidia-container-toolkit-daemonset-478rz 0/1 PodInitializing 0 93s
nvidia-container-toolkit-daemonset-478rz 1/1 Running 0 100s
nvidia-container-toolkit-daemonset-478rz 0/1 Completed 0 106s
nvidia-container-toolkit-daemonset-478rz 0/1 Completed 0 106s
nvidia-container-toolkit-daemonset-478rz 0/1 Completed 0 107s
nvidia-container-toolkit-daemonset-478rz 0/1 Init:0/1 1 107s
nvidia-container-toolkit-daemonset-478rz 0/1 Completed 0 108s
nvidia-container-toolkit-daemonset-478rz 1/1 Running 1 (14s ago) 114s
nvidia-container-toolkit-daemonset-478rz 0/1 Completed 1 (20s ago) 2m
nvidia-container-toolkit-daemonset-478rz 0/1 Completed 1 (21s ago) 2m1s
nvidia-container-toolkit-daemonset-478rz 0/1 Completed 1 (21s ago) 2m1s
nvidia-container-toolkit-daemonset-478rz 0/1 Init:0/1 2 2m1s
nvidia-container-toolkit-daemonset-478rz 0/1 Completed 1 2m2s
nvidia-container-toolkit-daemonset-478rz 0/1 CrashLoopBackOff 1 (9s ago) 2m3s
nvidia-container-toolkit-daemonset-478rz 1/1 Running 2 (26s ago) 2m20s
nvidia-container-toolkit-daemonset-478rz 0/1 Completed 2 (35s ago) 2m29s
nvidia-container-toolkit-daemonset-478rz 0/1 Completed 2 (36s ago) 2m30s
nvidia-container-toolkit-daemonset-478rz 0/1 Completed 2 (36s ago) 2m30s
nvidia-container-toolkit-daemonset-478rz 0/1 Completed 2 2m31s
nvidia-container-toolkit-daemonset-478rz 0/1 CrashLoopBackOff 2 (12s ago) 2m32s
nvidia-container-toolkit-daemonset-478rz 1/1 Running 3 (35s ago) 2m55s
nvidia-container-toolkit-daemonset-478rz 0/1 Completed 3 (41s ago) 3m1s
nvidia-container-toolkit-daemonset-478rz 0/1 Completed 3 (41s ago) 3m1s
nvidia-container-toolkit-daemonset-478rz 0/1 Completed 3 (41s ago) 3m1s
nvidia-container-toolkit-daemonset-478rz 0/1 Init:CrashLoopBackOff 3 (32s ago) 3m3s
Attached are the describe results and logs from container-toolkit.
It took a few hours, but I waited and everything was activated. Now that it's resolved, I'm closing the issue.Thanks for your help!
$ k -n gpu-operator get po
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-zlxd5 1/1 Running 0 14h
gpu-operator-5dc6b8989b-xm5x6 1/1 Running 0 14h
gpu-operator-node-feature-discovery-master-65c9bd48c4-wkjhm 1/1 Running 0 14h
gpu-operator-node-feature-discovery-worker-c2qkt 1/1 Running 0 14h
nvidia-container-toolkit-daemonset-zhfdq 1/1 Running 54 (10h ago) 14h
nvidia-cuda-validator-wzz95 0/1 Completed 0 10h
nvidia-dcgm-exporter-ww6l6 1/1 Running 0 14h
nvidia-device-plugin-daemonset-d4tqn 1/1 Running 0 14h
nvidia-device-plugin-validator-qdctq 0/1 Completed 0 10h
nvidia-driver-daemonset-79j6s 2/2 Running 0 14h
nvidia-mig-manager-9wj4v 1/1 Running 0 14h
nvidia-operator-validator-sr282 1/1 Running 0 14h
Hi all,
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes? No[x] Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces
) Yesdescribe clusterpolicies
``` $ k -n gpu-operator describe clusterpolicies.nvidia.com --all-namespaces Name: cluster-policy Namespace: Labels: app.kubernetes.io/component=gpu-operator app.kubernetes.io/managed-by=Helm Annotations: meta.helm.sh/release-name: gpu-operator meta.helm.sh/release-namespace: gpu-operator API Version: nvidia.com/v1 Kind: ClusterPolicy Metadata: Creation Timestamp: 2022-11-22T02:35:24Z Generation: 1 Managed Fields: API Version: nvidia.com/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:meta.helm.sh/release-name: f:meta.helm.sh/release-namespace: f:labels: .: f:app.kubernetes.io/component: f:app.kubernetes.io/managed-by: f:spec: .: f:daemonsets: .: f:priorityClassName: f:tolerations: f:dcgm: .: f:enabled: f:hostPort: f:image: f:imagePullPolicy: f:repository: f:version: f:dcgmExporter: .: f:enabled: f:env: f:image: f:imagePullPolicy: f:repository: f:serviceMonitor: .: f:additionalLabels: f:enabled: f:honorLabels: f:interval: f:version: f:devicePlugin: .: f:config: .: f:default: f:name: f:enabled: f:env: f:image: f:imagePullPolicy: f:repository: f:version: f:driver: .: f:certConfig: .: f:name: f:enabled: f:image: f:imagePullPolicy: f:kernelModuleConfig: .: f:name: f:licensingConfig: .: f:configMapName: f:nlsEnabled: f:manager: .: f:env: f:image: f:imagePullPolicy: f:repository: f:version: f:rdma: .: f:enabled: f:useHostMofed: f:repoConfig: .: f:configMapName: f:repository: f:rollingUpdate: .: f:maxUnavailable: f:version: f:virtualTopology: .: f:config: f:gfd: .: f:enabled: f:env: f:image: f:imagePullPolicy: f:repository: f:version: f:mig: .: f:strategy: f:migManager: .: f:config: .: f:name: f:enabled: f:env: f:gpuClientsConfig: .: f:name: f:image: f:imagePullPolicy: f:repository: f:version: f:nodeStatusExporter: .: f:enabled: f:image: f:imagePullPolicy: f:repository: f:version: f:operator: .: f:defaultRuntime: f:initContainer: .: f:image: f:imagePullPolicy: f:repository: f:version: f:runtimeClass: f:psp: .: f:enabled: f:sandboxDevicePlugin: .: f:enabled: f:image: f:imagePullPolicy: f:repository: f:version: f:sandboxWorkloads: .: f:defaultWorkload: f:enabled: f:toolkit: .: f:enabled: f:image: f:imagePullPolicy: f:installDir: f:repository: f:version: f:validator: .: f:image: f:imagePullPolicy: f:plugin: .: f:env: f:repository: f:version: f:vfioManager: .: f:driverManager: .: f:env: f:image: f:imagePullPolicy: f:repository: f:version: f:enabled: f:image: f:imagePullPolicy: f:repository: f:version: f:vgpuDeviceManager: .: f:config: .: f:default: f:name: f:enabled: f:image: f:imagePullPolicy: f:repository: f:version: f:vgpuManager: .: f:driverManager: .: f:env: f:image: f:imagePullPolicy: f:repository: f:version: f:enabled: f:image: f:imagePullPolicy: Manager: helm Operation: Update Time: 2022-11-22T02:35:24Z API Version: nvidia.com/v1 Fields Type: FieldsV1 fieldsV1: f:status: .: f:namespace: f:state: Manager: gpu-operator Operation: Update Subresource: status Time: 2022-11-22T02:35:44Z Resource Version: 23128 UID: 5ff18092-9076-4801-802d-c088cf09cc66 Spec: Daemonsets: Priority Class Name: system-node-critical Tolerations: Effect: NoSchedule Key: nvidia.com/gpu Operator: Exists Dcgm: Enabled: false Host Port: 5555 Image: dcgm Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia/cloud-native Version: 3.0.4-1-ubuntu20.04 Dcgm Exporter: Enabled: true Env: Name: DCGM_EXPORTER_LISTEN Value: :9400 Name: DCGM_EXPORTER_KUBERNETES Value: true Name: DCGM_EXPORTER_COLLECTORS Value: /etc/dcgm-exporter/dcp-metrics-included.csv Image: dcgm-exporter Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia/k8s Service Monitor: Additional Labels: Enabled: false Honor Labels: false Interval: 15s Version: 3.0.4-3.0.0-ubuntu20.04 Device Plugin: Config: Default: Name: Enabled: true Env: Name: PASS_DEVICE_SPECS Value: true Name: FAIL_ON_INIT_ERROR Value: true Name: DEVICE_LIST_STRATEGY Value: envvar Name: DEVICE_ID_STRATEGY Value: uuid Name: NVIDIA_VISIBLE_DEVICES Value: all Name: NVIDIA_DRIVER_CAPABILITIES Value: all Image: k8s-device-plugin Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia Version: v0.12.3-ubi8 Driver: Cert Config: Name: Enabled: true Image: driver Image Pull Policy: IfNotPresent Kernel Module Config: Name: Licensing Config: Config Map Name: Nls Enabled: false Manager: Env: Name: ENABLE_AUTO_DRAIN Value: true Name: DRAIN_USE_FORCE Value: false Name: DRAIN_POD_SELECTOR_LABEL Value: Name: DRAIN_TIMEOUT_SECONDS Value: 0s Name: DRAIN_DELETE_EMPTYDIR_DATA Value: false Image: k8s-driver-manager Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia/cloud-native Version: v0.4.2 Rdma: Enabled: true Use Host Mofed: false Repo Config: Config Map Name: Repository: nvcr.io/nvidia Rolling Update: Max Unavailable: 1 Version: 520.61.05 Virtual Topology: Config: Gfd: Enabled: true Env: Name: GFD_SLEEP_INTERVAL Value: 60s Name: GFD_FAIL_ON_INIT_ERROR Value: true Image: gpu-feature-discovery Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia Version: v0.6.2-ubi8 Mig: Strategy: single Mig Manager: Config: Name: Enabled: true Env: Name: WITH_REBOOT Value: false Gpu Clients Config: Name: Image: k8s-mig-manager Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia/cloud-native Version: v0.5.0-ubuntu20.04 Node Status Exporter: Enabled: false Image: gpu-operator-validator Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia/cloud-native Version: v22.9.0 Operator: Default Runtime: docker Init Container: Image: cuda Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia Version: 11.7.1-base-ubi8 Runtime Class: nvidia Psp: Enabled: false Sandbox Device Plugin: Enabled: true Image: kubevirt-gpu-device-plugin Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia Version: v1.2.1 Sandbox Workloads: Default Workload: container Enabled: false Toolkit: Enabled: true Image: container-toolkit Image Pull Policy: IfNotPresent Install Dir: /usr/local/nvidia Repository: nvcr.io/nvidia/k8s Version: v1.11.0-ubuntu20.04 Validator: Image: gpu-operator-validator Image Pull Policy: IfNotPresent Plugin: Env: Name: WITH_WORKLOAD Value: true Repository: nvcr.io/nvidia/cloud-native Version: v22.9.0 Vfio Manager: Driver Manager: Env: Name: ENABLE_AUTO_DRAIN Value: false Image: k8s-driver-manager Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia/cloud-native Version: v0.4.2 Enabled: true Image: cuda Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia Version: 11.7.1-base-ubi8 Vgpu Device Manager: Config: Default: default Name: Enabled: true Image: vgpu-device-manager Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia/cloud-native Version: v0.2.0 Vgpu Manager: Driver Manager: Env: Name: ENABLE_AUTO_DRAIN Value: false Image: k8s-driver-manager Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia/cloud-native Version: v0.4.2 Enabled: false Image: vgpu-manager Image Pull Policy: IfNotPresent Status: Namespace: gpu-operator State: notReady Events:1. Issue or feature description
After installing GPU-Operator with MIG and RDMA enabled, all but the gpu-operator pod and gpu-operator-node-feature-discovery pod stop at the initialization process.
2. Steps to reproduce the issue
3. Information
[x] kubernetes pods status:
kubectl get pods --all-namespaces
pods status
``` $ kubectl get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE calico-apiserver calico-apiserver-884778456-ddcsc 1/1 Running 0 168m calico-apiserver calico-apiserver-884778456-gttx9 1/1 Running 0 168m calico-system calico-kube-controllers-6b57db7fd6-hz8j4 1/1 Running 0 169m calico-system calico-node-bflqc 1/1 Running 0 169m calico-system calico-typha-59885bb5d7-hx2rz 1/1 Running 0 169m gpu-operator gpu-feature-discovery-rwkvl 0/1 Init:0/1 0 37m gpu-operator gpu-operator-5dc6b8989b-p6mbc 1/1 Running 0 37m gpu-operator gpu-operator-node-feature-discovery-master-65c9bd48c4-hqf5q 1/1 Running 0 37m gpu-operator gpu-operator-node-feature-discovery-worker-5p2h5 1/1 Running 0 37m gpu-operator nvidia-container-toolkit-daemonset-k8dl6 0/1 Init:0/1 0 37m gpu-operator nvidia-dcgm-exporter-bhjnb 0/1 Init:0/1 0 37m gpu-operator nvidia-device-plugin-daemonset-jzlbm 0/1 Init:0/1 0 37m gpu-operator nvidia-driver-daemonset-k8tvc 0/2 Init:0/2 0 37m gpu-operator nvidia-operator-validator-hlfp7 0/1 Init:0/4 0 37m kube-system coredns-565d847f94-9dq8f 1/1 Running 0 174m kube-system coredns-565d847f94-s9jd5 1/1 Running 0 174m kube-system etcd-mito26.server.org 1/1 Running 2 175m kube-system kube-apiserver-mito26.server.org 1/1 Running 2 175m kube-system kube-controller-manager-mito26.server.org 1/1 Running 0 174m kube-system kube-proxy-wpf76 1/1 Running 0 174m kube-system kube-scheduler-mito26.server.org 1/1 Running 2 174m tigera-operator tigera-operator-6bb5985474-srnn8 1/1 Running 0 169m ```[x] kubernetes daemonset status:
kubectl get ds --all-namespaces
get ds
``` $ kubectl get ds --all-namespaces NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE calico-system calico-node 1 1 1 1 1 kubernetes.io/os=linux 170m calico-system csi-node-driver 0 0 0 0 0 kubernetes.io/os=linux 170m gpu-operator gpu-feature-discovery 1 1 0 1 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 37m gpu-operator gpu-operator-node-feature-discovery-worker 1 1 1 1 1[x] If a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAME
describe pod gpu-feature-discovery
``` $ k -n gpu-operator describe po gpu-feature-discovery-rwkvl Name: gpu-feature-discovery-rwkvl Namespace: gpu-operator Priority: 2000001000 Priority Class Name: system-node-critical Service Account: nvidia-gpu-feature-discovery Node: mito26.server.org/192.168.200.96 Start Time: Tue, 22 Nov 2022 02:35:44 +0000 Labels: app=gpu-feature-discovery app.kubernetes.io/part-of=nvidia-gpu controller-revision-hash=5d85dbd666 pod-template-generation=1 Annotations: cni.projectcalico.org/containerID: e6ccd423694c07c80a917c696727b0a4febdfb9d8e1d12efde6d1999aa1a1641 cni.projectcalico.org/podIP: 10.244.11.124/32 cni.projectcalico.org/podIPs: 10.244.11.124/32 Status: Pending IP: 10.244.11.124 IPs: IP: 10.244.11.124 Controlled By: DaemonSet/gpu-feature-discovery Init Containers: toolkit-validation: Container ID: cri-o://3f4d69d9b0914823a7e79a5378f60979579cc971410328c4e05c1c92611fcd20 Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0 Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:6fe4200960b2b49d6dac1c91e596f61dacb6b3dcff878c84eb74c5136fedd5b6 Port:describe pod nvidia-container-toolkit-daemonset
``` $ k -n gpu-operator describe po nvidia-container-toolkit-daemonset-k8dl6 Name: nvidia-container-toolkit-daemonset-k8dl6 Namespace: gpu-operator Priority: 2000001000 Priority Class Name: system-node-critical Service Account: nvidia-container-toolkit Node: mito26.server.org/192.168.200.96 Start Time: Tue, 22 Nov 2022 02:35:43 +0000 Labels: app=nvidia-container-toolkit-daemonset controller-revision-hash=85d9894fb6 pod-template-generation=1 Annotations: cni.projectcalico.org/containerID: d13b07524f1df468caf094c0fccfaa4f0cbcedb8129f8fdb880389b16b125be5 cni.projectcalico.org/podIP: 10.244.11.120/32 cni.projectcalico.org/podIPs: 10.244.11.120/32 Status: Pending IP: 10.244.11.120 IPs: IP: 10.244.11.120 Controlled By: DaemonSet/nvidia-container-toolkit-daemonset Init Containers: driver-validation: Container ID: cri-o://6b6f1f6b7c1b90e571d5fa57464dcb1f7202a6f9796c743e3f152aeb0bad0072 Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0 Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:6fe4200960b2b49d6dac1c91e596f61dacb6b3dcff878c84eb74c5136fedd5b6 Port:describe pod nvidia-dcgm-exporter-
``` $ k -n gpu-operator describe po nvidia-dcgm-exporter-bhjnb Name: nvidia-dcgm-exporter-bhjnb Namespace: gpu-operator Priority: 2000001000 Priority Class Name: system-node-critical Service Account: nvidia-dcgm-exporter Node: mito26.server.org/192.168.200.96 Start Time: Tue, 22 Nov 2022 02:35:44 +0000 Labels: app=nvidia-dcgm-exporter controller-revision-hash=558d5c6485 pod-template-generation=1 Annotations: cni.projectcalico.org/containerID: c88c059e59af4f85e13e504e4a65a8ef5e0ec0cdd16b512cb0afbc04e58abfcc cni.projectcalico.org/podIP: 10.244.11.125/32 cni.projectcalico.org/podIPs: 10.244.11.125/32 Status: Pending IP: 10.244.11.125 IPs: IP: 10.244.11.125 Controlled By: DaemonSet/nvidia-dcgm-exporter Init Containers: toolkit-validation: Container ID: cri-o://523f91431044841a1706de65676b1efa761a53822b9689578687b52a9c82b7ec Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0 Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:6fe4200960b2b49d6dac1c91e596f61dacb6b3dcff878c84eb74c5136fedd5b6 Port:describe pod nvidia-device-plugin-daemonset
``` $ k -n gpu-operator describe po nvidia-device-plugin-daemonset-jzlbm Name: nvidia-device-plugin-daemonset-jzlbm Namespace: gpu-operator Priority: 2000001000 Priority Class Name: system-node-critical Service Account: nvidia-device-plugin Node: mito26.server.org/192.168.200.96 Start Time: Tue, 22 Nov 2022 02:35:43 +0000 Labels: app=nvidia-device-plugin-daemonset controller-revision-hash=759699b885 pod-template-generation=1 Annotations: cni.projectcalico.org/containerID: 15e9af34d96f8f5a1c27dc25d32ba7f7358ccd2e93a317ea23a98139e3a26c0a cni.projectcalico.org/podIP: 10.244.11.121/32 cni.projectcalico.org/podIPs: 10.244.11.121/32 Status: Pending IP: 10.244.11.121 IPs: IP: 10.244.11.121 Controlled By: DaemonSet/nvidia-device-plugin-daemonset Init Containers: toolkit-validation: Container ID: cri-o://53cff9e385493146546f6b4fb12b10037e7b1a99aa9124f56cb769be30e9cb6f Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0 Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:6fe4200960b2b49d6dac1c91e596f61dacb6b3dcff878c84eb74c5136fedd5b6 Port:describe pod nvidia-driver-daemonset
``` $ k -n gpu-operator describe po nvidia-driver-daemonset-k8tvc Name: nvidia-driver-daemonset-k8tvc Namespace: gpu-operator Priority: 2000001000 Priority Class Name: system-node-critical Service Account: nvidia-driver Node: mito26.server.org/192.168.200.96 Start Time: Tue, 22 Nov 2022 02:35:43 +0000 Labels: app=nvidia-driver-daemonset controller-revision-hash=6775f65988 pod-template-generation=1 Annotations: cni.projectcalico.org/containerID: 3f65dab3a8ed4b1e90955c04c9f9e8b61b9cc15f763d95e5cf06a8aa58974e2b cni.projectcalico.org/podIP: 10.244.11.123/32 cni.projectcalico.org/podIPs: 10.244.11.123/32 Status: Pending IP: 10.244.11.123 IPs: IP: 10.244.11.123 Controlled By: DaemonSet/nvidia-driver-daemonset Init Containers: mofed-validation: Container ID: cri-o://3bcf5bf91d19a174f28467e05cf7d0171b140d2b5261fbf37f9b93b9b065ce67 Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0 Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:6fe4200960b2b49d6dac1c91e596f61dacb6b3dcff878c84eb74c5136fedd5b6 Port:describe pod nvidia-operator-validator
``` $ k -n gpu-operator describe po nvidia-operator-validator-hlfp7 Name: nvidia-operator-validator-hlfp7 Namespace: gpu-operator Priority: 2000001000 Priority Class Name: system-node-critical Service Account: nvidia-operator-validator Node: mito26.server.org/192.168.200.96 Start Time: Tue, 22 Nov 2022 02:35:43 +0000 Labels: app=nvidia-operator-validator app.kubernetes.io/part-of=gpu-operator controller-revision-hash=df5cbdc4f pod-template-generation=1 Annotations: cni.projectcalico.org/containerID: f85abe423ababe00c088804732b87bdaac351df67a81d90f6c169a8ea71c56af cni.projectcalico.org/podIP: 10.244.11.122/32 cni.projectcalico.org/podIPs: 10.244.11.122/32 Status: Pending IP: 10.244.11.122 IPs: IP: 10.244.11.122 Controlled By: DaemonSet/nvidia-operator-validator Init Containers: driver-validation: Container ID: cri-o://78c087e010432e9a3bb21d98a701bb46d14ec2a10348cca52729547f58f7a88f Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0 Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:6fe4200960b2b49d6dac1c91e596f61dacb6b3dcff878c84eb74c5136fedd5b6 Port:[x] If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAME
Since kubectl log has no good output, we submit the output of /var/log/pods.gpu-feature-discovery log
``` $ sudo cat /var/log/pods/gpu-operator_gpu-feature-discovery-rwkvl_82dbe642-8f8b-4634-812e-171ee06cd5c1/toolkit-validation/0.log 2022-11-22T04:05:53.914790700+00:00 stdout F waiting for nvidia container stack to be setup 2022-11-22T04:05:58.917849636+00:00 stdout F waiting for nvidia container stack to be setup 2022-11-22T04:06:03.921302039+00:00 stdout F waiting for nvidia container stack to be setup 2022-11-22T04:06:08.924586551+00:00 stdout F waiting for nvidia container stack to be setup 2022-11-22T04:06:13.927938990+00:00 stdout F waiting for nvidia container stack to be setup 2022-11-22T04:06:18.931333373+00:00 stdout F waiting for nvidia container stack to be setup 2022-11-22T04:06:23.934661339+00:00 stdout F waiting for nvidia container stack to be setup ```nvidia-container-toolkit-daemonsets log
``` $ sudo cat /var/log/pods/gpu-operator_nvidia-container-toolkit-daemonset-k8dl6_3c2a649b-4eb5-4b15-b60b-7114f77692cf/driver-validation/0.log 2022-11-22T04:09:04.909735641+00:00 stderr F chroot: failed to run command 'nvidia-smi': No such file or directory 2022-11-22T04:09:04.910083989+00:00 stdout F command failed, retrying after 5 seconds 2022-11-22T04:09:09.910755502+00:00 stdout F running command chroot with args [/run/nvidia/driver nvidia-smi] 2022-11-22T04:09:09.913204824+00:00 stderr F chroot: failed to run command 'nvidia-smi': No such file or directory 2022-11-22T04:09:09.913523235+00:00 stdout F command failed, retrying after 5 seconds 2022-11-22T04:09:14.914704866+00:00 stdout F running command chroot with args [/run/nvidia/driver nvidia-smi] 2022-11-22T04:09:14.917195040+00:00 stderr F chroot: failed to run command 'nvidia-smi': No such file or directory 2022-11-22T04:09:14.917522610+00:00 stdout F command failed, retrying after 5 seconds ```nvidia-dcgm-exporter log
``` $ sudo cat /var/log/pods/gpu-operator_nvidia-dcgm-exporter-bhjnb_f76a839f-4670-4244-b145-a2bb6d506b86/toolkit-validation/0.log 2022-11-22T04:17:04.572623880+00:00 stdout F waiting for nvidia container stack to be setup 2022-11-22T04:17:09.575852179+00:00 stdout F waiting for nvidia container stack to be setup 2022-11-22T04:17:14.579083583+00:00 stdout F waiting for nvidia container stack to be setup 2022-11-22T04:17:19.582244567+00:00 stdout F waiting for nvidia container stack to be setup 2022-11-22T04:17:24.585574474+00:00 stdout F waiting for nvidia container stack to be setup 2022-11-22T04:17:29.588779832+00:00 stdout F waiting for nvidia container stack to be setup 2022-11-22T04:17:34.591940411+00:00 stdout F waiting for nvidia container stack to be setup ```nvidia-device-plugin-daemonset log
``` $ sudo cat /var/log/pods/gpu-operator_nvidia-device-plugin-daemonset-jzlbm_3fb1cb66-758c-4187-92cf-f8c956276ed1/toolkit-validation/0.log 2022-11-22T05:36:37.303169869+00:00 stdout F waiting for nvidia container stack to be setup 2022-11-22T05:36:42.306521315+00:00 stdout F waiting for nvidia container stack to be setup 2022-11-22T05:36:47.309980509+00:00 stdout F waiting for nvidia container stack to be setup 2022-11-22T05:36:52.313370794+00:00 stdout F waiting for nvidia container stack to be setup 2022-11-22T05:36:57.316784848+00:00 stdout F waiting for nvidia container stack to be setup 2022-11-22T05:37:02.320326747+00:00 stdout F waiting for nvidia container stack to be setup 2022-11-22T05:37:07.323688230+00:00 stdout F waiting for nvidia container stack to be setup 2022-11-22T05:37:12.327225309+00:00 stdout F waiting for nvidia container stack to be setup 2022-11-22T05:37:17.330619005+00:00 stdout F waiting for nvidia container stack to be setup 2022-11-22T05:37:22.334102431+00:00 stdout F waiting for nvidia container stack to be setu ```nvidia-driver-daemonset log
``` $ sudo cat /var/log/pods/gpu-operator_nvidia-driver-daemonset-k8tvc_d6d1d91a-5e0e-4701-a01d-9a0585c5232e/mofed-validation/0.log 2022-11-22T05:38:21.156056921+00:00 stdout F running command bash with args [-c stat /run/mellanox/drivers/.driver-ready] 2022-11-22T05:38:21.160991578+00:00 stderr F stat: cannot statx '/run/mellanox/drivers/.driver-ready': No such file or directory 2022-11-22T05:38:21.161307447+00:00 stdout F command failed, retrying after 5 seconds 2022-11-22T05:38:26.162320590+00:00 stdout F running command bash with args [-c stat /run/mellanox/drivers/.driver-ready] 2022-11-22T05:38:26.167358516+00:00 stderr F stat: cannot statx '/run/mellanox/drivers/.driver-ready': No such file or directory 2022-11-22T05:38:26.167683096+00:00 stdout F command failed, retrying after 5 seconds 2022-11-22T05:38:31.168620993+00:00 stdout F running command bash with args [-c stat /run/mellanox/drivers/.driver-ready] 2022-11-22T05:38:31.173547738+00:00 stderr F stat: cannot statx '/run/mellanox/drivers/.driver-ready': No such file or directory 2022-11-22T05:38:31.173882297+00:00 stdout F command failed, retrying after 5 seconds 2022-11-22T05:38:36.174883940+00:00 stdout F running command bash with args [-c stat /run/mellanox/drivers/.driver-ready] 2022-11-22T05:38:36.179943712+00:00 stderr F stat: cannot statx '/run/mellanox/drivers/.driver-ready': No such file or directory 2022-11-22T05:38:36.180201659+00:00 stdout F command failed, retrying after 5 seconds 2022-11-22T05:38:41.180712282+00:00 stdout F running command bash with args [-c stat /run/mellanox/drivers/.driver-ready] 2022-11-22T05:38:41.185811180+00:00 stderr F stat: cannot statx '/run/mellanox/drivers/.driver-ready': No such file or directory 2022-11-22T05:38:41.186148142+00:00 stdout F command failed, retrying after 5 seconds ```nvidia-operator-validator log
``` $ sudo cat /var/log/pods/gpu-operator_nvidia-operator-validator-hlfp7_eb50eb18-4f2b-402d-9308-8de3322a05ab/driver-validation/0.log 2022-11-22T05:39:40.043664142+00:00 stdout F running command chroot with args [/run/nvidia/driver nvidia-smi] 2022-11-22T05:39:40.046023497+00:00 stderr F chroot: failed to run command 'nvidia-smi': No such file or directory 2022-11-22T05:39:40.046306559+00:00 stdout F command failed, retrying after 5 seconds 2022-11-22T05:39:45.047051403+00:00 stdout F running command chroot with args [/run/nvidia/driver nvidia-smi] 2022-11-22T05:39:45.049490493+00:00 stderr F chroot: failed to run command 'nvidia-smi': No such file or directory 2022-11-22T05:39:45.049808369+00:00 stdout F command failed, retrying after 5 seconds 2022-11-22T05:39:50.050370297+00:00 stdout F running command chroot with args [/run/nvidia/driver nvidia-smi] 2022-11-22T05:39:50.052832256+00:00 stderr F chroot: failed to run command 'nvidia-smi': No such file or directory 2022-11-22T05:39:50.053162408+00:00 stdout F command failed, retrying after 5 seconds 2022-11-22T05:39:55.053958031+00:00 stdout F running command chroot with args [/run/nvidia/driver nvidia-smi] 2022-11-22T05:39:55.056469130+00:00 stderr F chroot: failed to run command 'nvidia-smi': No such file or directory 2022-11-22T05:39:55.056816581+00:00 stdout F command failed, retrying after 5 seconds ```[ ] Output of running a container on the GPU machine:
docker run -it alpine echo foo
[ ] Docker configuration file:
cat /etc/docker/daemon.json
[ ] Docker runtime configuration:
docker info | grep runtime
[x] NVIDIA shared directory:
ls -la /run/nvidia
[x] NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit
[x] NVIDIA driver directory:
ls -la /run/nvidia/driver
[ ] kubelet logs
journalctl -u kubelet > kubelet.logs