NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.87k stars 303 forks source link

gfd, container-toolkit, dcgm-exporter, device-plugin, driver, operator-validator stop at init and do not start. #448

Closed JunyaTaniai closed 1 year ago

JunyaTaniai commented 2 years ago

Hi all,

1. Quick Debug Checklist

1. Issue or feature description

After installing GPU-Operator with MIG and RDMA enabled, all but the gpu-operator pod and gpu-operator-node-feature-discovery pod stop at the initialization process.

2. Steps to reproduce the issue

$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
   && helm repo update
$ helm install --wait gpu-operator \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.enabled=true \
--set driver.version=520.61.05 \
--set driver.rdma.enabled=true \
--set toolkit.enabled=true \
--set migManager.enabled=true \
--set mig.strategy=single \
--set driver.rdma.useHostMofed=false \
--set nfd.enabled=true

3. Information

cdesiniotis commented 2 years ago

Hi @JunyaTaniai is the Network Operator installed and MOFED driver deployed? The GPU driver container waits for MOFED driver to be ready first before proceeding. And setting driver.rdma.useHostMofed=false indicates to the GPU Operator that MOFED driver is being installed via container (and not on the host) -- in this case we wait for the status file /run/mellanox/drivers/.driver-ready to be created before proceeding.

From your logs it appears this file is not being created:

$ sudo cat /var/log/pods/gpu-operator_nvidia-driver-daemonset-k8tvc_d6d1d91a-5e0e-4701-a01d-9a0585c5232e/mofed-validation/0.log
2022-11-22T05:38:21.156056921+00:00 stdout F running command bash with args [-c stat /run/mellanox/drivers/.driver-ready]
2022-11-22T05:38:21.160991578+00:00 stderr F stat: cannot statx '/run/mellanox/drivers/.driver-ready': No such file or directory
2022-11-22T05:38:21.161307447+00:00 stdout F command failed, retrying after 5 seconds
JunyaTaniai commented 2 years ago

Hi, @cdesiniotis Thanks for the answer. I have deployed the MOFED driver via Network Operator. However, several resources are still stuck in Init.。

$ k -n network-operator get po
NAME                                                             READY   STATUS    RESTARTS   AGE
mofed-ubuntu20.04-ds-2qct7                                       1/1     Running   0          3h13m
network-operator-688dd4f45-g787x                                 1/1     Running   0          3h14m
network-operator-node-feature-discovery-master-648d59648-znkwr   1/1     Running   0          3h14m
network-operator-node-feature-discovery-worker-m7lb6             1/1     Running   0          3h14m
rdma-shared-dp-ds-mrg2s                                          1/1     Running   0          3h6m

$ k -n gpu-operator logs nvidia-driver-daemonset-xmf7r mofed-validation
jyunya-taniai@mito26:~$ sudo cat /var/log/pods/gpu-operator_nvidia-driver-daemonset-xmf7r_c12dbbf4-5fb7-4565-a04f-e6a3fdde8318/mofed-validation/0.log 
running command bash with args [-c stat /run/mellanox/drivers/.driver-ready]
  File: /run/mellanox/drivers/.driver-ready
  Size: 0          Blocks: 0          IO Block: 4096   regular empty file
Device: 100061h/1048673d   Inode: 23335252    Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-11-24 00:21:00.169920830 +0000
Modify: 2022-11-24 00:21:00.169920830 +0000
Change: 2022-11-24 00:21:00.169920830 +0000
 Birth: 2022-11-24 00:21:00.169920830 +0000

$ k -n gpu-operator get po
NAME                                                          READY   STATUS                  RESTARTS      AGE
gpu-feature-discovery-7tx82                                   0/1     Init:0/1                0             17m
gpu-operator-5dc6b8989b-h5wpw                                 1/1     Running                 0             17m
gpu-operator-node-feature-discovery-master-65c9bd48c4-rzpzn   1/1     Running                 0             17m
gpu-operator-node-feature-discovery-worker-42qkd              1/1     Running                 0             17m
nvidia-container-toolkit-daemonset-jtppl                      1/1     Running                 0             17m
nvidia-dcgm-exporter-7vzwf                                    0/1     Init:0/1                0             17m
nvidia-device-plugin-daemonset-fdnfc                          0/1     Init:0/1                0             17m
nvidia-driver-daemonset-xmf7r                                 2/2     Running                 1 (16m ago)   17m
nvidia-operator-validator-kkvtx                               0/1     Init:CrashLoopBackOff   8 (15s ago)   17m

Checking the logs for the resource that is stuck in Init, it appears that the nvidia-container-toolkit validation is not succeeding. Also, the toolkit-validation container log for the nvidia-operator-validator resource is outputting that the nvidia-smi command cannot be executed.

$ k -n gpu-operator logs gpu-feature-discovery-7tx82 gpu-feature-discovery 
Error from server (BadRequest): container "gpu-feature-discovery" in pod "gpu-feature-discovery-7tx82" is waiting to start: PodInitializing

$ k -n gpu-operator logs gpu-feature-discovery-7tx82 toolkit-validation
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup

$ k -n gpu-operator logs nvidia-dcgm-exporter-7vzwf nvidia-dcgm-exporter 
Error from server (BadRequest): container "nvidia-dcgm-exporter" in pod "nvidia-dcgm-exporter-7vzwf" is waiting to start: PodInitializing

$ k -n gpu-operator logs nvidia-dcgm-exporter-7vzwf toolkit-validation 
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup

$ k -n gpu-operator logs nvidia-device-plugin-daemonset-fdnfc nvidia-device-plugin 
Error from server (BadRequest): container "nvidia-device-plugin" in pod "nvidia-device-plugin-daemonset-fdnfc" is waiting to start: PodInitializing

$ k -n gpu-operator logs nvidia-device-plugin-daemonset-fdnfc toolkit-validation 
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup

$ k -n gpu-operator logs nvidia-operator-validator-kkvtx toolkit-validation 
toolkit is not ready
time="2022-11-24T02:29:09Z" level=info msg="Error: error validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH"

Checking the nvidia-container-toolkit log, it appears that the driver validation is successful, but the nvidia-container-toolkit-ctr container log shows that the links to some files are not successful.

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

$ k -n gpu-operator logs nvidia-container-toolkit-daemonset-jtppl nvidia-container-toolkit-ctr time="2022-11-24T02:13:36Z" level=info msg="Starting nvidia-toolkit" time="2022-11-24T02:13:36Z" level=info msg="Parsing arguments" time="2022-11-24T02:13:36Z" level=info msg="Verifying Flags" time="2022-11-24T02:13:36Z" level=info msg=Initializing time="2022-11-24T02:13:36Z" level=info msg="Installing toolkit" time="2022-11-24T02:13:36Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]" time="2022-11-24T02:13:36Z" level=info msg="Successfully parsed arguments" time="2022-11-24T02:13:36Z" level=info msg="Installing NVIDIA container toolkit to '/usr/local/nvidia/toolkit'" time="2022-11-24T02:13:36Z" level=info msg="Removing existing NVIDIA container toolkit installation" time="2022-11-24T02:13:36Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit'" time="2022-11-24T02:13:36Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime'" time="2022-11-24T02:13:36Z" level=info msg="Installing NVIDIA container library to '/usr/local/nvidia/toolkit'" time="2022-11-24T02:13:36Z" level=info msg="Finding library libnvidia-container.so.1 (root=)" time="2022-11-24T02:13:36Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container.so.1'" time="2022-11-24T02:13:36Z" level=info msg="Skipping library candidate '/usr/lib64/libnvidia-container.so.1': error resolving link '/usr/lib64/libnvidia-container.so.1': lstat /usr/lib64/libnvidia-container.so.1: no such file or directory" time="2022-11-24T02:13:36Z" level=info msg="Checking library candidate '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1'" time="2022-11-24T02:13:36Z" level=info msg="Resolved link: '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1' => '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1.9.0'" time="2022-11-24T02:13:36Z" level=info msg="Installing '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1.9.0' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.9.0'" time="2022-11-24T02:13:36Z" level=info msg="Installed '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1.9.0' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.9.0'" time="2022-11-24T02:13:36Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container.so.1' -> 'libnvidia-container.so.1.9.0'" time="2022-11-24T02:13:36Z" level=info msg="Finding library libnvidia-container-go.so.1 (root=)" time="2022-11-24T02:13:36Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container-go.so.1'" time="2022-11-24T02:13:36Z" level=info msg="Skipping library candidate '/usr/lib64/libnvidia-container-go.so.1': error resolving link '/usr/lib64/libnvidia-container-go.so.1': lstat /usr/lib64/libnvidia-container-go.so.1: no such file or directory" time="2022-11-24T02:13:36Z" level=info msg="Checking library candidate '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1'" time="2022-11-24T02:13:36Z" level=info msg="Resolved link: '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1' => '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1.9.0'" time="2022-11-24T02:13:36Z" level=info msg="Installing '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1.9.0' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.9.0'" time="2022-11-24T02:13:36Z" level=info msg="Installed '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1.9.0' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.9.0'" time="2022-11-24T02:13:36Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1' -> 'libnvidia-container-go.so.1.9.0'" time="2022-11-24T02:13:36Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime' to /usr/local/nvidia/toolkit" time="2022-11-24T02:13:36Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'" time="2022-11-24T02:13:36Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'" time="2022-11-24T02:13:36Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime'" time="2022-11-24T02:13:36Z" level=info msg="Finding library libnvidia-ml.so (root=/run/nvidia/driver)" time="2022-11-24T02:13:36Z" level=info msg="Checking library candidate '/run/nvidia/driver/usr/lib64/libnvidia-ml.so'" time="2022-11-24T02:13:36Z" level=info msg="Skipping library candidate '/run/nvidia/driver/usr/lib64/libnvidia-ml.so': error resolving link '/run/nvidia/driver/usr/lib64/libnvidia-ml.so': lstat /run/nvidia/driver/usr/lib64/libnvidia-ml.so: no such file or directory" time="2022-11-24T02:13:36Z" level=info msg="Checking library candidate '/run/nvidia/driver/usr/lib/x86_64-linux-gnu/libnvidia-ml.so'" time="2022-11-24T02:13:36Z" level=info msg="Resolved link: '/run/nvidia/driver/usr/lib/x86_64-linux-gnu/libnvidia-ml.so' => '/run/nvidia/driver/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.520.61.05'" time="2022-11-24T02:13:36Z" level=info msg="Using library root /run/nvidia/driver/usr/lib/x86_64-linux-gnu" time="2022-11-24T02:13:36Z" level=info msg="Installing executable 'nvidia-container-runtime.experimental' to /usr/local/nvidia/toolkit" time="2022-11-24T02:13:36Z" level=info msg="Installing 'nvidia-container-runtime.experimental' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.experimental'" time="2022-11-24T02:13:36Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.experimental'" time="2022-11-24T02:13:36Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental'" time="2022-11-24T02:13:36Z" level=info msg="Installing NVIDIA container CLI from '/usr/bin/nvidia-container-cli'" time="2022-11-24T02:13:36Z" level=info msg="Installing executable '/usr/bin/nvidia-container-cli' to /usr/local/nvidia/toolkit" time="2022-11-24T02:13:36Z" level=info msg="Installing '/usr/bin/nvidia-container-cli' to '/usr/local/nvidia/toolkit/nvidia-container-cli.real'" time="2022-11-24T02:13:36Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-cli.real'" time="2022-11-24T02:13:36Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-cli'" time="2022-11-24T02:13:36Z" level=info msg="Installing NVIDIA container runtime hook from '/usr/bin/nvidia-container-toolkit'" time="2022-11-24T02:13:36Z" level=info msg="Installing executable '/usr/bin/nvidia-container-toolkit' to /usr/local/nvidia/toolkit" time="2022-11-24T02:13:36Z" level=info msg="Installing '/usr/bin/nvidia-container-toolkit' to '/usr/local/nvidia/toolkit/nvidia-container-toolkit.real'" time="2022-11-24T02:13:36Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-toolkit.real'" time="2022-11-24T02:13:36Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-toolkit'" time="2022-11-24T02:13:36Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook' -> 'nvidia-container-toolkit'" time="2022-11-24T02:13:36Z" level=info msg="Installing NVIDIA container toolkit config '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml'" time="2022-11-24T02:13:36Z" level=info msg="Setting up runtime" time="2022-11-24T02:13:36Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]" time="2022-11-24T02:13:36Z" level=info msg="Successfully parsed arguments" time="2022-11-24T02:13:36Z" level=info msg="Starting 'setup' for crio" time="2022-11-24T02:13:36Z" level=info msg="Waiting for signal"

I checked each file on the node and sure enough, it did not seem to exist. Also, the nvidia-smi command does not succeed in executing because libnvidia-ml.so is not present.

$ ll /usr/lib64/libnvidia-container.so.1 ls: cannot access '/usr/lib64/libnvidia-container.so.1': No such file or directory

$ ll /usr/lib64/libnvidia-container-go.so.1 ls: cannot access '/usr/lib64/libnvidia-container-go.so.1': No such file or directory

$ ll /usr/lib64/ total 8 drwxr-xr-x 2 root root 4096 Oct 10 23:35 ./ drwxr-xr-x 14 root root 4096 Oct 10 23:36 ../ lrwxrwxrwx 1 root root 32 Apr 7 2022 ld-linux-x86-64.so.2 -> /lib/x86_64-linux-gnu/ld-2.31.so*

$ ll /run/nvidia/driver/usr/lib64/libnvidia-ml.so ls: cannot access '/run/nvidia/driver/usr/lib64/libnvidia-ml.so': No such file or directory

$ ll /run/nvidia/driver/usr/lib64/ total 12 drwxr-xr-x 2 root root 4096 Aug 1 13:25 ./ drwxr-xr-x 1 root root 4096 Aug 1 13:22 ../ lrwxrwxrwx 1 root root 32 Apr 7 2022 ld-linux-x86-64.so.2 -> /lib/x86_64-linux-gnu/ld-2.31.so*

$ ll /run/nvidia/driver/bin/ | grep nvidia -rwxr-xr-x 1 root root 38025 Nov 24 02:12 nvidia-bug-report.sh -rwxr-xr-x 1 root root 50024 Nov 24 02:12 nvidia-cuda-mps-control -rwxr-xr-x 1 root root 14560 Nov 24 02:12 nvidia-cuda-mps-server -rwxr-xr-x 1 root root 137904 Nov 24 02:12 nvidia-debugdump -rwxr-xr-x 1 root root 355344 Nov 24 02:12 nvidia-installer -rwxr-xr-x 1 root root 3892272 Nov 24 02:12 nvidia-ngx-updater -rwxr-xr-x 1 root root 208336 Nov 24 02:12 nvidia-persistenced -rwxr-xr-x 1 root root 585216 Nov 24 02:12 nvidia-powerd -rwxr-xr-x 1 root root 302216 Nov 24 02:12 nvidia-settings -rwxr-xr-x 1 root root 900 Nov 24 02:12 nvidia-sleep.sh -rwxr-xr-x 1 root root 600760 Nov 24 02:12 nvidia-smi lrwxrwxrwx 1 root root 16 Nov 24 02:12 nvidia-uninstall -> nvidia-installer -rwxr-xr-x 1 root root 207424 Nov 24 02:12 nvidia-xconfig*

$ ll /run/nvidia/driver/bin/nvidia-smi -rwxr-xr-x 1 root root 600760 Nov 24 02:12 /run/nvidia/driver/bin/nvidia-smi*

$ /run/nvidia/driver/bin/nvidia-smi -L NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system. Please also try adding directory that contains libnvidia-ml.so to your system PATH.



Is there a way to get nvidia-container-toolkit validation to succeed so that everything is RUNNING?
cdesiniotis commented 1 year ago

@JunyaTaniai I suspect CRIO is not configured to look for OCI hooks in /run/containers/oci/hooks.d. If so, the NVIDIA Container Runtime Hook will not be used (we currently put our hook in this directory). Can you try this workaround? https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#custom-configuration-for-runtime-cri-o. After restarting CRIO, restart the operator-validator pod to restart the toolkit validation.

JunyaTaniai commented 1 year ago

Hi, @cdesiniotis Thank you for your answer.I have been using cri-o 1.25 on Ubuntu 20.04, is this configuration supported? https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/platform-support.html#supported-container-runtimes If it appears that even the latest version of GPU-Operator is not supported, I will reinstall k8s with containerd and configure the following. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#custom-configuration-for-runtime-containerd

JunyaTaniai commented 1 year ago

Hi, @cdesiniotis I had to redo it with containerd because I was using cri-o on Ubuntu 20.04. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/platform-support.html#supported-container-runtimes

I also redeployed GPU-Operator by giving helm the values for container-toolkit based on the following documentation https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#custom-configuration-for-runtime-containerd

$ helm install --wait gpu-operator \
> -n gpu-operator --create-namespace \
> nvidia/gpu-operator \
> --set driver.enabled=true \
> --set driver.version=520.61.05 \
> --set driver.rdma.enabled=true \
> --set migManager.enabled=true \
> --set mig.strategy=single \
> --set driver.rdma.useHostMofed=false \
> --set nfd.enabled=true \
> --set toolkit.enabled=true \
> --set toolkit.version="devel-ubuntu20.04" \
> --set toolkit.env[0].name=CONTAINERD_CONFIG \
> --set toolkit.env[0].value=/etc/containerd/config.toml \
> --set toolkit.env[1].name=CONTAINERD_SOCKET \
> --set toolkit.env[1].value=/run/containerd/containerd.sock \
> --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
> --set toolkit.env[2].value=nvidia \
> --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
> --set-string toolkit.env[3].value=true
NAME: gpu-operator
LAST DEPLOYED: Tue Nov 29 09:01:27 2022
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None

However, now container-toolkit repeatedly restarts.

$ k -n gpu-operator get po -w
NAME                                                          READY   STATUS     RESTARTS   AGE
gpu-feature-discovery-km5mf                                   0/1     Init:0/1   0          24s
gpu-operator-5dc6b8989b-kfxkb                                 1/1     Running    0          42s
gpu-operator-node-feature-discovery-master-65c9bd48c4-ghcnx   1/1     Running    0          42s
gpu-operator-node-feature-discovery-worker-sdvp6              1/1     Running    0          42s
nvidia-container-toolkit-daemonset-478rz                      0/1     Init:0/1   0          24s
nvidia-dcgm-exporter-kwpjv                                    0/1     Init:0/1   0          24s
nvidia-device-plugin-daemonset-7rppw                          0/1     Init:0/1   0          24s
nvidia-driver-daemonset-54wbd                                 0/2     Running    0          24s
nvidia-mig-manager-zrmrn                                      0/1     Init:0/1   0          24s
nvidia-operator-validator-nbqrc                               0/1     Init:0/4   0          19s
nvidia-driver-daemonset-54wbd                                 0/2     Running    0          91s
nvidia-driver-daemonset-54wbd                                 1/2     Running    0          91s
nvidia-driver-daemonset-54wbd                                 1/2     Running    0          91s
nvidia-driver-daemonset-54wbd                                 2/2     Running    0          91s
nvidia-container-toolkit-daemonset-478rz                      0/1     PodInitializing   0          93s
nvidia-container-toolkit-daemonset-478rz                      1/1     Running           0          100s
nvidia-container-toolkit-daemonset-478rz                      0/1     Completed         0          106s
nvidia-container-toolkit-daemonset-478rz                      0/1     Completed         0          106s
nvidia-container-toolkit-daemonset-478rz                      0/1     Completed         0          107s
nvidia-container-toolkit-daemonset-478rz                      0/1     Init:0/1          1          107s
nvidia-container-toolkit-daemonset-478rz                      0/1     Completed         0          108s
nvidia-container-toolkit-daemonset-478rz                      1/1     Running           1 (14s ago)   114s
nvidia-container-toolkit-daemonset-478rz                      0/1     Completed         1 (20s ago)   2m
nvidia-container-toolkit-daemonset-478rz                      0/1     Completed         1 (21s ago)   2m1s
nvidia-container-toolkit-daemonset-478rz                      0/1     Completed         1 (21s ago)   2m1s
nvidia-container-toolkit-daemonset-478rz                      0/1     Init:0/1          2             2m1s
nvidia-container-toolkit-daemonset-478rz                      0/1     Completed         1             2m2s
nvidia-container-toolkit-daemonset-478rz                      0/1     CrashLoopBackOff   1 (9s ago)    2m3s
nvidia-container-toolkit-daemonset-478rz                      1/1     Running            2 (26s ago)   2m20s
nvidia-container-toolkit-daemonset-478rz                      0/1     Completed          2 (35s ago)   2m29s
nvidia-container-toolkit-daemonset-478rz                      0/1     Completed          2 (36s ago)   2m30s
nvidia-container-toolkit-daemonset-478rz                      0/1     Completed          2 (36s ago)   2m30s
nvidia-container-toolkit-daemonset-478rz                      0/1     Completed          2             2m31s
nvidia-container-toolkit-daemonset-478rz                      0/1     CrashLoopBackOff   2 (12s ago)   2m32s
nvidia-container-toolkit-daemonset-478rz                      1/1     Running            3 (35s ago)   2m55s
nvidia-container-toolkit-daemonset-478rz                      0/1     Completed          3 (41s ago)   3m1s
nvidia-container-toolkit-daemonset-478rz                      0/1     Completed          3 (41s ago)   3m1s
nvidia-container-toolkit-daemonset-478rz                      0/1     Completed          3 (41s ago)   3m1s
nvidia-container-toolkit-daemonset-478rz                      0/1     Init:CrashLoopBackOff   3 (32s ago)   3m3s

Attached are the describe results and logs from container-toolkit.

describe nvidia-container-toolkit ``` $ k -n gpu-operator describe po nvidia-container-toolkit-daemonset-478rz Name: nvidia-container-toolkit-daemonset-478rz Namespace: gpu-operator Priority: 2000001000 Priority Class Name: system-node-critical Service Account: nvidia-container-toolkit Node: mito26/192.168.200.96 Start Time: Wed, 30 Nov 2022 04:05:03 +0000 Labels: app=nvidia-container-toolkit-daemonset controller-revision-hash=7667984ccd pod-template-generation=1 Annotations: cni.projectcalico.org/containerID: aabc9bd9e2fd0f9b78d17fa823968e519e539a9b3ec36c32baa5cf421af00408 cni.projectcalico.org/podIP: 10.244.53.144/32 cni.projectcalico.org/podIPs: 10.244.53.144/32 Status: Running IP: 10.244.53.144 IPs: IP: 10.244.53.144 Controlled By: DaemonSet/nvidia-container-toolkit-daemonset Init Containers: driver-validation: Container ID: containerd://4c9d3d6f6704b0130feac5af1d7f2c6a40531cf56634c5ccd8aaf4f8bb4f6a14 Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0 Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:6fe4200960b2b49d6dac1c91e596f61dacb6b3dcff878c84eb74c5136fedd5b6 Port: Host Port: Command: sh -c Args: nvidia-validator State: Terminated Reason: Completed Exit Code: 0 Started: Wed, 30 Nov 2022 04:13:29 +0000 Finished: Wed, 30 Nov 2022 04:13:29 +0000 Ready: True Restart Count: 3 Environment: WITH_WAIT: true COMPONENT: driver Mounts: /host from host-root (ro) /run/nvidia/driver from driver-install-path (rw) /run/nvidia/validations from run-nvidia-validations (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8hggj (ro) Containers: nvidia-container-toolkit-ctr: Container ID: containerd://5f74e224965b1fe192cb969f9e8f7256f349f424e5d6d1b470587c71c9f5867b Image: nvcr.io/nvidia/k8s/container-toolkit:devel-ubuntu20.04 Image ID: nvcr.io/nvidia/k8s/container-toolkit@sha256:ab294deee14f471f5e859e9f107c413e34f0f2e0d9dc6f2bfd3f0584ed88a5e8 Port: Host Port: Command: bash -c Args: [[ -f /run/nvidia/validations/host-driver-ready ]] && driver_root=/ || driver_root=/run/nvidia/driver; export NVIDIA_DRIVER_ROOT=$driver_root; exec nvidia-toolkit /usr/local/nvidia State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Completed Exit Code: 0 Started: Wed, 30 Nov 2022 04:13:15 +0000 Finished: Wed, 30 Nov 2022 04:13:21 +0000 Ready: False Restart Count: 6 Environment: RUNTIME_ARGS: --socket /runtime/sock-dir/containerd.sock --config /runtime/config-dir/config.toml CONTAINERD_CONFIG: /etc/containerd/config.toml CONTAINERD_SOCKET: /run/containerd/containerd.sock CONTAINERD_RUNTIME_CLASS: nvidia CONTAINERD_SET_AS_DEFAULT: true RUNTIME: containerd Mounts: /host from host-root (ro) /run/nvidia from nvidia-run-path (rw) /runtime/config-dir/ from containerd-config (rw) /runtime/sock-dir/ from containerd-socket (rw) /usr/local/nvidia from toolkit-install-dir (rw) /usr/share/containers/oci/hooks.d from crio-hooks (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8hggj (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: nvidia-run-path: Type: HostPath (bare host directory volume) Path: /run/nvidia HostPathType: DirectoryOrCreate run-nvidia-validations: Type: HostPath (bare host directory volume) Path: /run/nvidia/validations HostPathType: DirectoryOrCreate driver-install-path: Type: HostPath (bare host directory volume) Path: /run/nvidia/driver HostPathType: host-root: Type: HostPath (bare host directory volume) Path: / HostPathType: toolkit-install-dir: Type: HostPath (bare host directory volume) Path: /usr/local/nvidia HostPathType: crio-hooks: Type: HostPath (bare host directory volume) Path: /run/containers/oci/hooks.d HostPathType: containerd-config: Type: HostPath (bare host directory volume) Path: /etc/containerd HostPathType: containerd-socket: Type: HostPath (bare host directory volume) Path: /run/containerd HostPathType: kube-api-access-8hggj: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: nvidia.com/gpu.deploy.container-toolkit=true Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists nvidia.com/gpu:NoSchedule op=Exists Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 11m default-scheduler Successfully assigned gpu-operator/nvidia-container-toolkit-daemonset-478rz to mito26 Normal Started 10m (x2 over 10m) kubelet Started container nvidia-container-toolkit-ctr Normal Killing 10m (x2 over 10m) kubelet Stopping container nvidia-container-toolkit-ctr Normal SandboxChanged 9m54s (x4 over 10m) kubelet Pod sandbox changed, it will be killed and re-created. Normal Pulled 9m53s (x3 over 11m) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0" already present on machine Normal Created 9m53s (x3 over 11m) kubelet Created container driver-validation Normal Started 9m53s (x3 over 11m) kubelet Started container driver-validation Normal Pulled 9m40s (x3 over 10m) kubelet Container image "nvcr.io/nvidia/k8s/container-toolkit:devel-ubuntu20.04" already present on machine Normal Created 9m40s (x3 over 10m) kubelet Created container nvidia-container-toolkit-ctr Warning BackOff 101s (x38 over 9m52s) kubelet Back-off restarting failed container ```
nvidia-container-toolkit log ``` $ k -n gpu-operator logs nvidia-container-toolkit-daemonset-478rz driver-validation running command chroot with args [/run/nvidia/driver nvidia-smi] Wed Nov 30 04:13:29 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-PCI... On | 00000000:65:00.0 Off | 0 | | N/A 40C P0 34W / 250W | 0MiB / 40960MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ $ k -n gpu-operator logs nvidia-container-toolkit-daemonset-478rz nvidia-container-toolkit-ctr time="2022-11-30T04:13:15Z" level=info msg="Starting nvidia-toolkit" time="2022-11-30T04:13:15Z" level=info msg="Parsing arguments" time="2022-11-30T04:13:15Z" level=info msg="Verifying Flags" time="2022-11-30T04:13:15Z" level=info msg=Initializing time="2022-11-30T04:13:15Z" level=info msg="Installing toolkit" time="2022-11-30T04:13:15Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]" time="2022-11-30T04:13:15Z" level=info msg="Successfully parsed arguments" time="2022-11-30T04:13:15Z" level=info msg="Installing NVIDIA container toolkit to '/usr/local/nvidia/toolkit'" time="2022-11-30T04:13:15Z" level=info msg="Removing existing NVIDIA container toolkit installation" time="2022-11-30T04:13:15Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit'" time="2022-11-30T04:13:15Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime'" time="2022-11-30T04:13:15Z" level=info msg="Installing NVIDIA container library to '/usr/local/nvidia/toolkit'" time="2022-11-30T04:13:15Z" level=info msg="Finding library libnvidia-container.so.1 (root=)" time="2022-11-30T04:13:15Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container.so.1'" time="2022-11-30T04:13:15Z" level=info msg="Skipping library candidate '/usr/lib64/libnvidia-container.so.1': error resolving link '/usr/lib64/libnvidia-container.so.1': lstat /usr/lib64/libnvidia-container.so.1: no such file or directory" time="2022-11-30T04:13:15Z" level=info msg="Checking library candidate '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1'" time="2022-11-30T04:13:15Z" level=info msg="Resolved link: '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1' => '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1.9.0'" time="2022-11-30T04:13:15Z" level=info msg="Installing '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1.9.0' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.9.0'" time="2022-11-30T04:13:15Z" level=info msg="Installed '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1.9.0' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.9.0'" time="2022-11-30T04:13:15Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container.so.1' -> 'libnvidia-container.so.1.9.0'" time="2022-11-30T04:13:15Z" level=info msg="Finding library libnvidia-container-go.so.1 (root=)" time="2022-11-30T04:13:15Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container-go.so.1'" time="2022-11-30T04:13:15Z" level=info msg="Skipping library candidate '/usr/lib64/libnvidia-container-go.so.1': error resolving link '/usr/lib64/libnvidia-container-go.so.1': lstat /usr/lib64/libnvidia-container-go.so.1: no such file or directory" time="2022-11-30T04:13:15Z" level=info msg="Checking library candidate '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1'" time="2022-11-30T04:13:15Z" level=info msg="Resolved link: '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1' => '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1.9.0'" time="2022-11-30T04:13:15Z" level=info msg="Installing '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1.9.0' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.9.0'" time="2022-11-30T04:13:15Z" level=info msg="Installed '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1.9.0' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.9.0'" time="2022-11-30T04:13:15Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1' -> 'libnvidia-container-go.so.1.9.0'" time="2022-11-30T04:13:15Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime' to /usr/local/nvidia/toolkit" time="2022-11-30T04:13:15Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'" time="2022-11-30T04:13:15Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'" time="2022-11-30T04:13:15Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime'" time="2022-11-30T04:13:15Z" level=info msg="Finding library libnvidia-ml.so (root=/run/nvidia/driver)" time="2022-11-30T04:13:15Z" level=info msg="Checking library candidate '/run/nvidia/driver/usr/lib64/libnvidia-ml.so'" time="2022-11-30T04:13:15Z" level=info msg="Skipping library candidate '/run/nvidia/driver/usr/lib64/libnvidia-ml.so': error resolving link '/run/nvidia/driver/usr/lib64/libnvidia-ml.so': lstat /run/nvidia/driver/usr/lib64/libnvidia-ml.so: no such file or directory" time="2022-11-30T04:13:15Z" level=info msg="Checking library candidate '/run/nvidia/driver/usr/lib/x86_64-linux-gnu/libnvidia-ml.so'" time="2022-11-30T04:13:15Z" level=info msg="Resolved link: '/run/nvidia/driver/usr/lib/x86_64-linux-gnu/libnvidia-ml.so' => '/run/nvidia/driver/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.520.61.05'" time="2022-11-30T04:13:15Z" level=info msg="Using library root /run/nvidia/driver/usr/lib/x86_64-linux-gnu" time="2022-11-30T04:13:15Z" level=info msg="Installing executable 'nvidia-container-runtime.experimental' to /usr/local/nvidia/toolkit" time="2022-11-30T04:13:15Z" level=info msg="Installing 'nvidia-container-runtime.experimental' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.experimental'" time="2022-11-30T04:13:15Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.experimental'" time="2022-11-30T04:13:15Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental'" time="2022-11-30T04:13:15Z" level=info msg="Installing NVIDIA container CLI from '/usr/bin/nvidia-container-cli'" time="2022-11-30T04:13:15Z" level=info msg="Installing executable '/usr/bin/nvidia-container-cli' to /usr/local/nvidia/toolkit" time="2022-11-30T04:13:15Z" level=info msg="Installing '/usr/bin/nvidia-container-cli' to '/usr/local/nvidia/toolkit/nvidia-container-cli.real'" time="2022-11-30T04:13:15Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-cli.real'" time="2022-11-30T04:13:15Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-cli'" time="2022-11-30T04:13:15Z" level=info msg="Installing NVIDIA container runtime hook from '/usr/bin/nvidia-container-toolkit'" time="2022-11-30T04:13:15Z" level=info msg="Installing executable '/usr/bin/nvidia-container-toolkit' to /usr/local/nvidia/toolkit" time="2022-11-30T04:13:15Z" level=info msg="Installing '/usr/bin/nvidia-container-toolkit' to '/usr/local/nvidia/toolkit/nvidia-container-toolkit.real'" time="2022-11-30T04:13:15Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-toolkit.real'" time="2022-11-30T04:13:15Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-toolkit'" time="2022-11-30T04:13:15Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook' -> 'nvidia-container-toolkit'" time="2022-11-30T04:13:15Z" level=info msg="Installing NVIDIA container toolkit config '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml'" time="2022-11-30T04:13:15Z" level=info msg="Setting up runtime" time="2022-11-30T04:13:15Z" level=info msg="Starting 'setup' for containerd" time="2022-11-30T04:13:15Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]" time="2022-11-30T04:13:15Z" level=info msg="Successfully parsed arguments" time="2022-11-30T04:13:15Z" level=info msg="Loading config: /runtime/config-dir/config.toml" time="2022-11-30T04:13:15Z" level=info msg="Successfully loaded config" time="2022-11-30T04:13:15Z" level=info msg="Config version: 2" time="2022-11-30T04:13:15Z" level=info msg="Updating config" time="2022-11-30T04:13:15Z" level=info msg="Successfully updated config" time="2022-11-30T04:13:15Z" level=info msg="Flushing config" time="2022-11-30T04:13:15Z" level=info msg="Successfully flushed config" time="2022-11-30T04:13:15Z" level=info msg="Sending SIGHUP signal to containerd" time="2022-11-30T04:13:15Z" level=info msg="Successfully signaled containerd" time="2022-11-30T04:13:15Z" level=info msg="Waiting for signal" time="2022-11-30T04:13:21Z" level=info msg="Cleaning up Runtime" time="2022-11-30T04:13:21Z" level=info msg="Starting 'cleanup' for containerd" time="2022-11-30T04:13:21Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]" time="2022-11-30T04:13:21Z" level=info msg="Successfully parsed arguments" time="2022-11-30T04:13:21Z" level=info msg="Loading config: /runtime/config-dir/config.toml" time="2022-11-30T04:13:21Z" level=info msg="Successfully loaded config" time="2022-11-30T04:13:21Z" level=info msg="Config version: 2" time="2022-11-30T04:13:21Z" level=info msg="Reverting config" time="2022-11-30T04:13:21Z" level=info msg="Successfully reverted config" time="2022-11-30T04:13:21Z" level=info msg="Flushing config" time="2022-11-30T04:13:21Z" level=info msg="Successfully flushed config" time="2022-11-30T04:13:21Z" level=info msg="Sending SIGHUP signal to containerd" ```
JunyaTaniai commented 1 year ago

It took a few hours, but I waited and everything was activated. Now that it's resolved, I'm closing the issue.Thanks for your help!

$ k -n gpu-operator get po
NAME                                                          READY   STATUS      RESTARTS       AGE
gpu-feature-discovery-zlxd5                                   1/1     Running     0              14h
gpu-operator-5dc6b8989b-xm5x6                                 1/1     Running     0              14h
gpu-operator-node-feature-discovery-master-65c9bd48c4-wkjhm   1/1     Running     0              14h
gpu-operator-node-feature-discovery-worker-c2qkt              1/1     Running     0              14h
nvidia-container-toolkit-daemonset-zhfdq                      1/1     Running     54 (10h ago)   14h
nvidia-cuda-validator-wzz95                                   0/1     Completed   0              10h
nvidia-dcgm-exporter-ww6l6                                    1/1     Running     0              14h
nvidia-device-plugin-daemonset-d4tqn                          1/1     Running     0              14h
nvidia-device-plugin-validator-qdctq                          0/1     Completed   0              10h
nvidia-driver-daemonset-79j6s                                 2/2     Running     0              14h
nvidia-mig-manager-9wj4v                                      1/1     Running     0              14h
nvidia-operator-validator-sr282                               1/1     Running     0              14h