NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.74k stars 281 forks source link

Pods take 25-30 minutes to terminate #303

Open dbugit opened 2 years ago

dbugit commented 2 years ago

1. Quick Debug Checklist

1. Issue or feature description

The gpu-operator deploys and runs in our test cluster just fine, and the canned examples return the expected results. However, when uninstalling the operator, all of its related Pods remain in a Terminating state for 25-30 minutes before actually terminating, during which time containerd is inaccessible. Is this normal?

2. Steps to reproduce the issue

Given an override.yaml file as such:

toolkit:
  version: 1.7.2-ubi8

dcgm:
  version: 2.3.1-ubi8

dcgmExporter:
  version: 2.3.1-2.6.0-ubi8

migManager:
  version: v0.2.0-ubi8

deploy via Helm: helm install gpu-test nvidia/gpu-operator --version 1.9.0 -n gpu-test (note that the gpu-test namespace is created beforehand).

After verifying that the validators finish and all other Pods are in a Running state, I let the cluster sit for about 10 minutes and then remove the operator with the command helm uninstall gpu-test -n gpu-test. I verify the Terminating state with repeated calls to kubectl get pods -n gpu-test, sometimes via watch if I'm really feeling lazy.

3. Information to attach (optional if deemed irrelevant)

While the cluster is running and before uninstalling the gpu-operator, I observe the following. Note that these logs were recorded during different runs at different times, so not everything is from the same test or in chronological order. Note also that node007 is the one GPU node in the test cluster -- and yes, it has a license to kill pods on that node.

[root@node007 ~]# cat /etc/containerd/config.toml
root = "/var/lib/containerd"
state = "/run/containerd"
version = 2

[plugins]

  [plugins."io.containerd.grpc.v1.cri"]
    disable_apparmor = true
    disable_cgroup = false
    enable_selinux = false
    sandbox_image = "k8s.gcr.io/kubernetes/pause:3.5"

    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"
      snapshotter = "overlayfs"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
            SystemdCgroup = true

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental]
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"
            SystemdCgroup = true

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            SystemdCgroup = true

[root@node007 ~]# lsof /run/containerd/containerd.sock
COMMAND     PID USER   FD   TYPE             DEVICE SIZE/OFF    NODE NAME
container 11342 root   73u  unix 0xffff90cf1e7c5d80      0t0 6319896 /run/containerd/containerd.sock
container 11342 root  126u  unix 0xffff90114a7f9100      0t0 6258658 /run/containerd/containerd.sock
container 11342 root  127u  unix 0xffff90114a7f9980      0t0 6316625 /run/containerd/containerd.sock
container 11342 root  128u  unix 0xffff901152b85500      0t0 6316627 /run/containerd/containerd.sock

[root@node007 ~]# ps -e|grep contain
11342 ?        00:00:28 containerd
11602 ?        00:00:00 containerd-shim
11660 ?        00:00:00 containerd-shim
11923 ?        00:00:00 containerd-shim
12090 ?        00:00:01 containerd-shim
12110 ?        00:00:00 containerd-shim
12164 ?        00:00:00 containerd-shim
12827 ?        00:00:00 containerd-shim
14033 ?        00:00:00 containerd-shim
14279 ?        00:00:00 containerd-shim
14395 ?        00:00:00 containerd-shim
14783 ?        00:00:00 containerd-shim
15047 ?        00:00:00 containerd-shim
15963 ?        00:00:00 containerd-shim
17601 ?        00:00:00 containerd-shim
18811 ?        00:00:00 containerd-shim

[root@node007 ~]# crictl ps -a
CONTAINER           IMAGE               CREATED             STATE               NAME                             ATTEMPT             POD ID
0d28a2bf51e20       be04d15b835f9       13 minutes ago      Running             nvidia-operator-validator        0                   827f818004285
71e743472d8ff       be04d15b835f9       13 minutes ago      Exited              nvidia-device-plugin-validator   0                   0dc75b8f4f6d5
83e67fc09858d       be04d15b835f9       13 minutes ago      Exited              plugin-validation                0                   0dc75b8f4f6d5
377152472cbb6       7a954b2e4193f       14 minutes ago      Running             nvidia-dcgm-exporter             0                   c66fd38059a5c
798a7d9708412       76b7b2bd88dd6       14 minutes ago      Running             nvidia-dcgm-ctr                  0                   24d64dc37cce8
2622b83d0ccfa       2832d3d94eb9f       14 minutes ago      Running             gpu-feature-discovery            0                   bbd3030143a79
ecd7ba68312ec       be04d15b835f9       14 minutes ago      Exited              plugin-validation                0                   827f818004285
00f95519a883e       514330f64584b       14 minutes ago      Running             nvidia-device-plugin-ctr         0                   8cd51061d5265
278bde4507961       be04d15b835f9       14 minutes ago      Exited              nvidia-cuda-validator            0                   3bbeae4bfd98c
35d5712f70711       be04d15b835f9       14 minutes ago      Exited              toolkit-validation               0                   24d64dc37cce8
e0b13770756a6       be04d15b835f9       14 minutes ago      Exited              cuda-validation                  0                   3bbeae4bfd98c
b2941b5801c01       be04d15b835f9       14 minutes ago      Exited              cuda-validation                  0                   827f818004285
50e4a213f1e01       be04d15b835f9       14 minutes ago      Exited              toolkit-validation               0                   827f818004285
7a19d5d5fd2c1       be04d15b835f9       14 minutes ago      Exited              driver-validation                0                   827f818004285
c9c0d31a06438       be04d15b835f9       14 minutes ago      Exited              toolkit-validation               0                   c66fd38059a5c
6762a3a9186fc       be04d15b835f9       14 minutes ago      Exited              toolkit-validation               0                   bbd3030143a79
aaa190b6c297c       be04d15b835f9       14 minutes ago      Exited              toolkit-validation               0                   8cd51061d5265
caf4daf88f9df       8ca879a398918       14 minutes ago      Running             nvidia-container-toolkit-ctr     0                   a57cc3a9e81b1
537feea12c394       02a92b6609aa0       15 minutes ago      Running             nvidia-driver-ctr                0                   5b55f170dd281
e3d0d6005c192       be04d15b835f9       15 minutes ago      Exited              driver-validation                0                   a57cc3a9e81b1
4520e6414468d       a33e62c447208       16 minutes ago      Exited              k8s-driver-manager               0                   5b55f170dd281
0fe415214e207       a9f76bcccfb5f       16 minutes ago      Running             controller                       0                   beddf86fde73f
e4010c273aec5       c41e9fcadf5a2       16 minutes ago      Exited              patch                            0                   52983ef09e110
86ce87c9e4da1       e118805916b97       16 minutes ago      Running             dashboard                        0                   1626bb92eb63a
3c5ceca84b8d1       48d79e554db69       16 minutes ago      Running             dashboard-metrics-scraper        0                   41c842833d20b
b19bf6cc69fc2       227ae20e1b044       16 minutes ago      Running             prometheus                       0                   dd085c90e00ca
8ae7cde4ed070       0fafea1498594       16 minutes ago      Running             node-exporter                    0                   4c5d1e24942b6
328b02e54de8f       cc8c6c9cc9f49       16 minutes ago      Running             worker                           0                   a2a938c7d3c74
f0f4be3e82190       5ef66b403f4f0       17 minutes ago      Running             calico-node                      0                   9b0699d773a58
997c11bab5c5d       5991877ebc118       17 minutes ago      Exited              flexvol-driver                   0                   9b0699d773a58
b2fddc4d633eb       4945b742b8e66       17 minutes ago      Exited              install-cni                      0                   9b0699d773a58
bc197cdc60765       4945b742b8e66       17 minutes ago      Exited              upgrade-ipam                     0                   9b0699d773a58
6bdede97e0478       8f8fdd6672d48       17 minutes ago      Running             kube-proxy                       0                   cd3aac9686931

but then after issuing the helm uninstall command, during the Terminating state:

[root@node007 ~]# cat /etc/containerd/config.toml
root = "/var/lib/containerd"
state = "/run/containerd"
version = 2

[plugins]

  [plugins."io.containerd.grpc.v1.cri"]
    disable_apparmor = true
    disable_cgroup = false
    enable_selinux = false
    sandbox_image = "k8s.gcr.io/kubernetes/pause:3.5"

    [plugins."io.containerd.grpc.v1.cri".containerd]
      snapshotter = "overlayfs"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            SystemdCgroup = true

[root@node007 ~]# lsof /run/containerd/containerd.sock

[root@node007 ~]# ps -e|grep contain
11602 ?        00:00:02 containerd-shim
11660 ?        00:00:02 containerd-shim
12090 ?        00:00:05 containerd-shim
12110 ?        00:00:03 containerd-shim
12164 ?        00:00:02 containerd-shim
12827 ?        00:00:02 containerd-shim
17601 ?        00:00:02 containerd-shim
18811 ?        00:00:02 containerd-shim
30487 ?        00:00:02 containerd

[root@node007 ~]# crictl ps -a
FATA[0002] connect: connect endpoint 'unix:///run/containerd/containerd.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded 

[root@node007 ~]# cat /etc/crictl.yaml 
runtime-endpoint: unix:///run/containerd/containerd.sock
timeout: 2
debug: false
pull-image-on-create: false

Also during the Terminating state, kubelet is logging a seemingly endless stream of these types of messages:

Dec 15 17:43:01 node007 kubelet[2101]: E1215 17:43:01.366810    2101 kuberuntime_sandbox.go:281] "Failed to list pod sandboxes" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\""
Dec 15 17:43:01 node007 kubelet[2101]: E1215 17:43:01.366855    2101 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\""
Dec 15 17:43:01 node007 kubelet[2101]: E1215 17:43:01.372996    2101 remote_runtime.go:314] "ListContainers with filter from runtime service failed" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" filter="&ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},}"
Dec 15 17:43:01 node007 kubelet[2101]: E1215 17:43:01.373080    2101 log_metrics.go:66] "Failed to get pod stats" err="failed to list all containers: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\""
Dec 15 17:43:01 node007 kubelet[2101]: E1215 17:43:01.373221    2101 remote_runtime.go:314] "ListContainers with filter from runtime service failed" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" filter="&ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},}"
Dec 15 17:43:01 node007 kubelet[2101]: E1215 17:43:01.797044    2101 remote_image.go:152] "ImageFsInfo from image service failed" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\""
Dec 15 17:43:01 node007 kubelet[2101]: E1215 17:43:01.797111    2101 eviction_manager.go:255] "Eviction manager: failed to get summary stats" err="failed to get imageFs stats: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\""
Dec 15 17:43:02 node007 kubelet[2101]: E1215 17:43:02.367405    2101 remote_runtime.go:207] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" filter="nil"
Dec 15 17:43:02 node007 kubelet[2101]: E1215 17:43:02.367473    2101 kuberuntime_sandbox.go:281] "Failed to list pod sandboxes" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\""
Dec 15 17:43:02 node007 kubelet[2101]: E1215 17:43:02.367519    2101 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\""
Dec 15 17:43:02 node007 kubelet[2101]: E1215 17:43:02.825860    2101 remote_runtime.go:207] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" filter="&PodSandboxFilter{Id:,State:&PodSandboxStateValue{State:SANDBOX_READY,},LabelSelector:map[string]string{},}"
Dec 15 17:43:02 node007 kubelet[2101]: E1215 17:43:02.825889    2101 kuberuntime_sandbox.go:281] "Failed to list pod sandboxes" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\""
Dec 15 17:43:02 node007 kubelet[2101]: E1215 17:43:02.825910    2101 kubelet_pods.go:1079] "Error listing containers" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\""
Dec 15 17:43:02 node007 kubelet[2101]: E1215 17:43:02.825929    2101 kubelet.go:2143] "Failed cleaning pods" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\""
Dec 15 17:43:03 node007 kubelet[2101]: E1215 17:43:03.368202    2101 remote_runtime.go:207] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" filter="nil"
Dec 15 17:43:03 node007 kubelet[2101]: E1215 17:43:03.368258    2101 kuberuntime_sandbox.go:281] "Failed to list pod sandboxes" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\""
Dec 15 17:43:03 node007 kubelet[2101]: E1215 17:43:03.368301    2101 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\""

while containerd logs countless iterations of these lines:

[root@node007 ~]# journalctl --no-pager -n 100 -xeu containerd
Dec 28 15:14:22 node007 systemd[1]: Starting containerd container runtime...
-- Subject: Unit containerd.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit containerd.service has begun starting up.
Dec 28 15:14:22 node007 containerd[909]: time="2021-12-28T15:14:22.239950959-06:00" level=info msg="starting containerd" revision=7b11cfaabd73bb80907dd23182b9347b4245eb5d version=1.4.12
Dec 28 15:14:22 node007 containerd[909]: time="2021-12-28T15:14:22.259956802-06:00" level=info msg="loading plugin \"io.containerd.content.v1.content\"..." type=io.containerd.content.v1
Dec 28 15:14:22 node007 containerd[909]: time="2021-12-28T15:14:22.260021449-06:00" level=info msg="loading plugin \"io.containerd.snapshotter.v1.aufs\"..." type=io.containerd.snapshotter.v1
Dec 28 15:14:22 node007 containerd[909]: time="2021-12-28T15:14:22.261628976-06:00" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.aufs\"..." error="aufs is not supported (modprobe aufs failed: exit status 1 \"modprobe: FATAL: Module aufs not found.\\n\"): skip plugin" type=io.containerd.snapshotter.v1
Dec 28 15:14:22 node007 containerd[909]: time="2021-12-28T15:14:22.261686779-06:00" level=info msg="loading plugin \"io.containerd.snapshotter.v1.devmapper\"..." type=io.containerd.snapshotter.v1
Dec 28 15:14:22 node007 containerd[909]: time="2021-12-28T15:14:22.261746901-06:00" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.devmapper" error="devmapper not configured"
Dec 28 15:14:22 node007 containerd[909]: time="2021-12-28T15:14:22.261773533-06:00" level=info msg="loading plugin \"io.containerd.snapshotter.v1.native\"..." type=io.containerd.snapshotter.v1
Dec 28 15:14:22 node007 containerd[909]: time="2021-12-28T15:14:22.261818018-06:00" level=info msg="loading plugin \"io.containerd.snapshotter.v1.overlayfs\"..." type=io.containerd.snapshotter.v1
Dec 28 15:14:22 node007 containerd[909]: time="2021-12-28T15:14:22.261918024-06:00" level=info msg="loading plugin \"io.containerd.snapshotter.v1.zfs\"..." type=io.containerd.snapshotter.v1
Dec 28 15:14:22 node007 containerd[909]: time="2021-12-28T15:14:22.263759158-06:00" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.zfs\"..." error="path /var/lib/containerd/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
Dec 28 15:14:22 node007 containerd[909]: time="2021-12-28T15:14:22.263788673-06:00" level=info msg="loading plugin \"io.containerd.metadata.v1.bolt\"..." type=io.containerd.metadata.v1
Dec 28 15:14:22 node007 containerd[909]: time="2021-12-28T15:14:22.263810651-06:00" level=warning msg="could not use snapshotter devmapper in metadata plugin" error="devmapper not configured"
Dec 28 15:14:22 node007 containerd[909]: time="2021-12-28T15:14:22.263818765-06:00" level=info msg="metadata content store policy set" policy=shared
Dec 28 15:14:22 node007 containerd[909]: time="2021-12-28T15:14:22.263935532-06:00" level=info msg="loading plugin \"io.containerd.differ.v1.walking\"..." type=io.containerd.differ.v1
Dec 28 15:14:22 node007 containerd[909]: time="2021-12-28T15:14:22.263949689-06:00" level=info msg="loading plugin \"io.containerd.gc.v1.scheduler\"..." type=io.containerd.gc.v1
Dec 28 15:14:22 node007 containerd[909]: time="2021-12-28T15:14:22.263991667-06:00" level=info msg="loading plugin \"io.containerd.service.v1.introspection-service\"..." type=io.containerd.service.v1
Dec 28 15:14:22 node007 containerd[909]: time="2021-12-28T15:14:22.264031146-06:00" level=info msg="loading plugin \"io.containerd.service.v1.containers-service\"..." type=io.containerd.service.v1
Dec 28 15:14:22 node007 containerd[909]: time="2021-12-28T15:14:22.264043760-06:00" level=info msg="loading plugin \"io.containerd.service.v1.content-service\"..." type=io.containerd.service.v1
Dec 28 15:14:22 node007 containerd[909]: time="2021-12-28T15:14:22.264055231-06:00" level=info msg="loading plugin \"io.containerd.service.v1.diff-service\"..." type=io.containerd.service.v1
Dec 28 15:14:22 node007 containerd[909]: time="2021-12-28T15:14:22.264071913-06:00" level=info msg="loading plugin \"io.containerd.service.v1.images-service\"..." type=io.containerd.service.v1
Dec 28 15:14:22 node007 containerd[909]: time="2021-12-28T15:14:22.264084838-06:00" level=info msg="loading plugin \"io.containerd.service.v1.leases-service\"..." type=io.containerd.service.v1
Dec 28 15:14:22 node007 containerd[909]: time="2021-12-28T15:14:22.264098030-06:00" level=info msg="loading plugin \"io.containerd.service.v1.namespaces-service\"..." type=io.containerd.service.v1
Dec 28 15:14:22 node007 containerd[909]: time="2021-12-28T15:14:22.264109792-06:00" level=info msg="loading plugin \"io.containerd.service.v1.snapshots-service\"..." type=io.containerd.service.v1
Dec 28 15:14:22 node007 containerd[909]: time="2021-12-28T15:14:22.264120383-06:00" level=info msg="loading plugin \"io.containerd.runtime.v1.linux\"..." type=io.containerd.runtime.v1
Dec 28 15:14:22 node007 containerd[909]: time="2021-12-28T15:14:22.264154685-06:00" level=info msg="loading plugin \"io.containerd.runtime.v2.task\"..." type=io.containerd.runtime.v2
Dec 28 15:15:52 node007 systemd[1]: containerd.service start operation timed out. Terminating.
Dec 28 15:16:02 node007 containerd[909]: time="2021-12-28T15:16:02.267378230-06:00" level=warning msg="cleaning up after shim disconnected" id=4c5d1e24942b6b990ef8aa0e645d7b7e9f999fb03f7e34b3f227c9649503b58e namespace=k8s.io
Dec 28 15:16:02 node007 containerd[909]: time="2021-12-28T15:16:02.267468307-06:00" level=info msg="cleaning up dead shim"
Dec 28 15:16:02 node007 containerd[909]: time="2021-12-28T15:16:02.284279109-06:00" level=warning msg="cleanup warnings time=\"2021-12-28T15:16:02-06:00\" level=info msg=\"starting signal loop\" namespace=k8s.io pid=1410\n"
Dec 28 15:16:02 node007 containerd[909]: time="2021-12-28T15:16:02.287295750-06:00" level=error msg="loading container 86ce87c9e4da198823b0163f405984e15fa8dd3cf2b67b7c88c9e776fe5bc93b" error="container \"86ce87c9e4da198823b0163f405984e15fa8dd3cf2b67b7c88c9e776fe5bc93b\" in namespace \"k8s.io\": not found"
Dec 28 15:16:02 node007 containerd[909]: time="2021-12-28T15:16:02.287550704-06:00" level=error msg="loading container 8ae7cde4ed07061c07129de8e9ba6dbc14ebdfb17b7fe65c539948ca4a52f824" error="container \"8ae7cde4ed07061c07129de8e9ba6dbc14ebdfb17b7fe65c539948ca4a52f824\" in namespace \"k8s.io\": not found"
Dec 28 15:17:22 node007 systemd[1]: containerd.service stop-final-sigterm timed out. Killing.
Dec 28 15:17:22 node007 systemd[1]: containerd.service: main process exited, code=killed, status=9/KILL
Dec 28 15:17:22 node007 systemd[1]: Failed to start containerd container runtime.
-- Subject: Unit containerd.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit containerd.service has failed.
-- 
-- The result is failed.
Dec 28 15:17:22 node007 systemd[1]: Unit containerd.service entered failed state.
Dec 28 15:17:22 node007 systemd[1]: containerd.service failed.
Dec 28 15:17:27 node007 systemd[1]: containerd.service holdoff time over, scheduling restart.
Dec 28 15:17:27 node007 systemd[1]: Stopped containerd container runtime.
-- Subject: Unit containerd.service has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit containerd.service has finished shutting down.

Note that, during every iteration of the log messages, containerd always times out after loading the io.containerd.runtime.v2.task plugin.

NAME                                       READY   STATUS        RESTARTS        AGE
gpu-feature-discovery-zcfb4                1/1     Terminating   0               7h25m
nvidia-container-toolkit-daemonset-59krk   1/1     Terminating   0               7h25m
nvidia-dcgm-exporter-hcv96                 1/1     Terminating   0               7h25m
nvidia-dcgm-gsrbn                          1/1     Terminating   0               7h25m
nvidia-device-plugin-daemonset-w5nm8       1/1     Terminating   0               7h25m
nvidia-driver-daemonset-pcsmh              1/1     Terminating   0               7h25m
nvidia-operator-validator-gbdf8            1/1     Terminating   0               7h25m
Events:
  Type     Reason         Age    From     Message
  ----     ------         ----   ----     -------
  Normal   Killing        2m21s  kubelet  Stopping container nvidia-container-toolkit-ctr
  Warning  FailedKillPod  2m21s  kubelet  error killing pod: [failed to "KillContainer" for "nvidia-container-toolkit-ctr" with KillContainerError: "rpc error: code = Unavailable desc = error reading from server: EOF", failed to "KillPodSandbox" for "4334b292-45db-4fc6-a37e-348330dd05f4" with KillPodSandboxError: "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\""]
unable to retrieve container logs for containerd://2b705a3195f2278cf2e0e9b379721b1c1ab5e10c1f535ff8358f0cf94520a04e
[root@node007 ~]# ls -la /run/nvidia
total 4
drwxr-xr-x  4 root root  100 Dec 30 19:36 .
drwxr-xr-x 36 root root 1140 Dec 30 12:08 ..
drwxr-xr-x  1 root root   77 Dec 30 12:11 driver
-rw-r--r--  1 root root    6 Dec 30 12:11 nvidia-driver.pid
drwxr-xr-x  2 root root   40 Dec 30 19:36 validations
[root@node007 ~]# ls -la /usr/local/nvidia/toolkit/
total 8476
drwxr-xr-x 3 root root    4096 Dec 30 12:12 .
drwxr-xr-x 3 root root    4096 Dec 30 12:12 ..
drwxr-xr-x 3 root root    4096 Dec 30 12:12 .config
lrwxrwxrwx 1 root root      28 Dec 30 12:12 libnvidia-container.so.1 -> libnvidia-container.so.1.6.0
-rwxr-xr-x 1 root root  183288 Dec 30 12:12 libnvidia-container.so.1.6.0
-rwxr-xr-x 1 root root     154 Dec 30 12:12 nvidia-container-cli
-rwxr-xr-x 1 root root   43024 Dec 30 12:12 nvidia-container-cli.real
-rwxr-xr-x 1 root root     342 Dec 30 12:12 nvidia-container-runtime
-rwxr-xr-x 1 root root     414 Dec 30 12:12 nvidia-container-runtime-experimental
-rwxr-xr-x 1 root root 4006376 Dec 30 12:12 nvidia-container-runtime.experimental
lrwxrwxrwx 1 root root      24 Dec 30 12:12 nvidia-container-runtime-hook -> nvidia-container-toolkit
-rwxr-xr-x 1 root root 2260408 Dec 30 12:12 nvidia-container-runtime.real
-rwxr-xr-x 1 root root     198 Dec 30 12:12 nvidia-container-toolkit
-rwxr-xr-x 1 root root 2147896 Dec 30 12:12 nvidia-container-toolkit.real
[root@node007 ~]# ls -la /run/nvidia/driver/
total 28
drwxr-xr-x   1 root root    77 Dec 30 12:11 .
drwxr-xr-x   4 root root   100 Dec 30 19:36 ..
-rw-r--r--   1 root root 12114 Nov 12  2020 anaconda-post.log
lrwxrwxrwx   1 root root     7 Nov 12  2020 bin -> usr/bin
drwxr-xr-x  18 root root  3780 Dec 30 12:12 dev
drwxr-xr-x   1 root root    84 Dec 30 12:12 etc
drwxr-xr-x   2 root root     6 Apr 10  2018 home
drwxr-xr-x   2 root root    24 Dec 30 12:11 host-etc
lrwxrwxrwx   1 root root     7 Nov 12  2020 lib -> usr/lib
lrwxrwxrwx   1 root root     9 Nov 12  2020 lib64 -> usr/lib64
drwxr-xr-x   2 root root     6 Apr 10  2018 media
drwxr-xr-x   2 root root     6 Apr 10  2018 mnt
-rw-r--r--   1 root root 16047 Aug  3 15:33 NGC-DL-CONTAINER-LICENSE
drwxr-xr-x   2 root root     6 Apr 10  2018 opt
dr-xr-xr-x 443 root root     0 Dec 17 12:49 proc
dr-xr-x---   1 root root    18 Aug  3 15:33 root
drwxr-xr-x   1 root root    88 Dec 30 19:36 run
lrwxrwxrwx   1 root root     8 Nov 12  2020 sbin -> usr/sbin
drwxr-xr-x   2 root root     6 Apr 10  2018 srv
dr-xr-xr-x  13 root root     0 Dec 30 12:10 sys
drwxrwxrwt   1 root root     6 Dec 30 12:12 tmp
drwxr-xr-x   1 root root    65 Nov 12  2020 usr
drwxr-xr-x   1 root root    41 Nov 12  2020 var
shivamerla commented 2 years ago

@dbugit i see that you are using ubi8 toolkit images, please change it to 1.7.2-centos7 and give it a try. Also, if you can try with 1.5+ containerd that will be good.

dbugit commented 2 years ago

Changing to the 1.7.2-centos7 image didn't improve the timing issues at all but did seem to clean driver components off of the node better. The ubi8 image would often leave behind containers/processes, and prevent other things from happening, like cleaning up iptables and umounting volumes.

But this leads me to a bigger question. As far as I can tell, the toolkit is the only image built specifically for CentOS 7. Other component images only support UBI 8 or Ubuntu, and still others are platform agnostic. Why the mix? Shouldn't all the images be built for and tested against the same platform?