Back-off restarting failed container nvidia-device-plugin-ctr

A-Akhil commented 1 month ago

When we are trying to install Kubernets(K8s) in DGX A100 server at the time of Helm install for nvidia-device-plugin we are getting the following error

kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE kube-flannel kube-flannel-ds-2ss2c 1/1 Running 1 (3d21h ago) 3d22h kube-flannel kube-flannel-ds-9cwh9 1/1 Running 0 3d22h kube-system coredns-787d4945fb-9rcpx 1/1 Running 0 3d22h kube-system coredns-787d4945fb-9scjh 1/1 Running 0 3d22h kube-system etcd-sybsramma-virtual-machine 1/1 Running 0 3d22h kube-system gpu-feature-discovery-1712918793-gpu-feature-discovery-dr6ht 1/1 Running 0 3d21h kube-system gpu-feature-discovery-1712918793-node-feature-discovery-marrffd 1/1 Running 0 3d21h kube-system gpu-feature-discovery-1712918793-node-feature-discovery-womw95r 1/1 Running 1 (3d21h ago) 3d21h root@sybsramma-virtual-machine:~# 3d22h kube-system kube-controller-manager-sybsramma-virtual-machine 1/1 Running 0 3d22h kube-system kube-proxy-hnb42 1/1 Running 0 3d22h kube-system kube-proxy-s7q7h 1/1 Running 1 (3d21h ago) 3d22h kube-system kube-scheduler-sybsramma-virtual-machine 1/1 Running 0 3d22h kube-system nvidia-device-plugin-1712918682-bs4vf 0/1 CrashLoopBackOff 1104 (23s ago) 3d21h

kubectl describe pod nvidia-device-plugin-1712918682-bs4vf -n kube-system

Name: nvidia-device-plugin-1712918682-bs4vf Namespace: kube-system Priority: 2000001000 Priority Class Name: system-node-critical Service Account: default Node: dgxa100/ Start Time: Fri, 12 Apr 2024 16:16:59 +0530 Labels: app.kubernetes.io/instance=nvidia-device-plugin-1712918682 app.kubernetes.io/name=nvidia-device-plugin controller-revision-hash=665b565fc7 pod-template-generation=1 Annotations: Status: Running IP:
IPs: IP:
Controlled By: DaemonSet/nvidia-device-plugin-1712918682 Containers: nvidia-device-plugin-ctr: Container ID: containerd://9ad6475f973adb6fb463acff145cb7609e0a2e728d12a0c4ae9cf77ed2201cde Image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0-rc.2 Image ID: nvcr.io/nvidia/k8s-device-plugin@sha256:0585da349f3cdca29747834e39ada56aed5e23ba363908fc526474d25aa61a75 Port: Host Port: Command: nvidia-device-plugin State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 1 Started: Tue, 16 Apr 2024 13:40:32 +0530 Finished: Tue, 16 Apr 2024 13:40:32 +0530 Ready: False Restart Count: 1100 Environment: MPS_ROOT: /run/nvidia/mps MIG_STRATEGY: single NVIDIA_MIG_MONITOR_DEVICES: all NVIDIA_VISIBLE_DEVICES: all NVIDIA_DRIVER_CAPABILITIES: compute,utility Mounts: /dev/shm from mps-shm (rw) /mps from mps-root (rw) /var/lib/kubelet/device-plugins from device-plugin (rw) /var/run/cdi from cdi-root (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r9php (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: device-plugin: Type: HostPath (bare host directory volume) Path: /var/lib/kubelet/device-plugins HostPathType: mps-root: Type: HostPath (bare host directory volume) Path: /run/nvidia/mps HostPathType: DirectoryOrCreate mps-shm: Type: HostPath (bare host directory volume) Path: /run/nvidia/mps/shm HostPathType: cdi-root: Type: HostPath (bare host directory volume) Path: /var/run/cdi HostPathType: DirectoryOrCreate kube-api-access-r9php: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: Tolerations: CriticalAddonsOnly op=Exists node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists nvidia.com/gpu:NoSchedule op=Exists Events: Type Reason Age From Message

Warning BackOff 3m20s (x25802 over 3d21h) kubelet Back-off restarting failed container nvidia-device-plugin-ctr in pod nvidia-device-plugin-1712918682-bs4vf_kube-system(7b6ec6ee-2aed-41b2-8b69-2975749172ec)

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): 20.04.1-Ubuntu
Kernel Version:5.15.0-102-generic
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):Containerd version 2
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):

kubectl version WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.15", GitCommit:"1649f592f1909b97aa3c2a0a8f968a3fd05a7b8b", GitTreeState:"clean", BuildDate:"2024-03-14T01:05:39Z", GoVersion:"go1.21.8", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.15", GitCommit:"1649f592f1909b97aa3c2a0a8f968a3fd05a7b8b", GitTreeState:"clean", BuildDate:"2024-03-14T00:54:27Z", GoVersion:"go1.21.8", Compiler:"gc", Platform:"linux/amd64"

GPU Operator Version: This is my output from my server NVIDIA-SMI 470.141.10 Driver Version: 470.141.10 CUDA Version: 11.4

elezar commented 1 month ago

@A-Akhil could you please provide information on the helm values that you are providing to the plugin? Do any of the plugin containers show any logs?

A-Akhil commented 1 month ago

@elezar This is the command which we used to install the nvidia-device plugin using helm helm install \ --version=0.15.0-rc.2 \ --generate-name \ --namespace kube-system \ --create-namespace \ --set migStrategy=single \ nvdp/nvidia-device-plugin

This is the log root@sybsramma-virtual-machine:~# kubectl logs nvidia-device-plugin-1712918682-bs4vf -n kube-system I0417 03:40:28.204942 1 main.go:178] Starting FS watcher. I0417 03:40:28.205038 1 main.go:185] Starting OS watcher. I0417 03:40:28.205179 1 main.go:200] Starting Plugins. I0417 03:40:28.205216 1 main.go:257] Loading configuration. I0417 03:40:28.205615 1 main.go:265] Updating config with default resource matching patterns. I0417 03:40:28.205917 1 main.go:276] Running with config: { "version": "v1", "flags": { "migStrategy": "single", "failOnInitError": true, "mpsRoot": "/run/nvidia/mps", "nvidiaDriverRoot": "/", "gdsEnabled": false, "mofedEnabled": false, "useNodeFeatureAPI": null, "plugin": { "passDeviceSpecs": false, "deviceListStrategy": [ "envvar" ], "deviceIDStrategy": "uuid", "cdiAnnotationPrefix": "cdi.k8s.io/", "nvidiaCTKPath": "/usr/bin/nvidia-ctk", "containerDriverRoot": "/driver-root" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ], "mig": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} } } I0417 03:40:28.205925 1 main.go:279] Retrieving plugins. W0417 03:40:28.205971 1 factory.go:31] No valid resources detected, creating a null CDI handler I0417 03:40:28.205998 1 factory.go:104] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory I0417 03:40:28.206027 1 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found E0417 03:40:28.206033 1 factory.go:112] Incompatible platform detected E0417 03:40:28.206037 1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit? E0417 03:40:28.206041 1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites E0417 03:40:28.206046 1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start E0417 03:40:28.206051 1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes E0417 03:40:28.213668 1 main.go:132] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed root@sybsramma-virtual-machine:~#

and this is the describe ` root@sybsramma-virtual-machine:~# kubectl describe pod nvidia-device-plugin-1712918682-bs4vf -n kube-system Name: nvidia-device-plugin-1712918682-bs4vf Namespace: kube-system Priority: 2000001000 Priority Class Name: system-node-critical Service Account: default Node: dgxa100/172.16.0.32 Start Time: Fri, 12 Apr 2024 16:16:59 +0530 Labels: app.kubernetes.io/instance=nvidia-device-plugin-1712918682 app.kubernetes.io/name=nvidia-device-plugin controller-revision-hash=665b565fc7 pod-template-generation=1 Annotations: Status: Running IP: 10.244.1.47 IPs: IP: 10.244.1.47 Controlled By: DaemonSet/nvidia-device-plugin-1712918682 Containers: nvidia-device-plugin-ctr: Container ID: containerd://62c676b690d74f0591ba712abb5d0fb567c7ab0c9d65e56a188b5b51f9c65ade Image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0-rc.2 Image ID: nvcr.io/nvidia/k8s-device-plugin@sha256:0585da349f3cdca29747834e39ada56aed5e23ba363908fc526474d25aa61a75 Port: Host Port: Command: nvidia-device-plugin State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 1 Started: Wed, 17 Apr 2024 09:15:30 +0530 Finished: Wed, 17 Apr 2024 09:15:30 +0530 Ready: False Restart Count: 1330 Environment: MPS_ROOT: /run/nvidia/mps MIG_STRATEGY: single NVIDIA_MIG_MONITOR_DEVICES: all NVIDIA_VISIBLE_DEVICES: all NVIDIA_DRIVER_CAPABILITIES: compute,utility Mounts: /dev/shm from mps-shm (rw) /mps from mps-root (rw) /var/lib/kubelet/device-plugins from device-plugin (rw) /var/run/cdi from cdi-root (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r9php (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: device-plugin: Type: HostPath (bare host directory volume) Path: /var/lib/kubelet/device-plugins HostPathType: mps-root: Type: HostPath (bare host directory volume) Path: /run/nvidia/mps HostPathType: DirectoryOrCreate mps-shm: Type: HostPath (bare host directory volume) Path: /run/nvidia/mps/shm HostPathType: cdi-root: Type: HostPath (bare host directory volume) Path: /var/run/cdi HostPathType: DirectoryOrCreate kube-api-access-r9php: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: Tolerations: CriticalAddonsOnly op=Exists node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists nvidia.com/gpu:NoSchedule op=Exists Events: Type Reason Age From Message

Warning BackOff 4m13s (x31185 over 4d16h) kubelet Back-off restarting failed container nvidia-device-plugin-ctr in pod nvidia-device-plugin-1712918682-bs4vf_kube-system(7b6ec6ee-2aed-41b2-8b69-2975749172ec) root@sybsramma-virtual-machine:~#`

Kubernetes_Documentation.pdf

This is the documentation which i used to install the plugin In this we choosed mig enabled with same instance type for helm install

elezar commented 1 week ago

@A-Akhil have you installed the NVIDIA Container Toolkit on the host and configured containerd with this runtime?

Furthermore if the nvidia runtime is not the default runtime for Containerd you would also need to set up and request a RuntimeClass.

See https://github.com/NVIDIA/k8s-device-plugin/issues/604#issuecomment-2097764513 and https://github.com/NVIDIA/k8s-device-plugin/issues/604#issuecomment-2097855058 for more information.

NVIDIA / k8s-device-plugin

Back-off restarting failed container nvidia-device-plugin-ctr #652

1. Quick Debug Information