NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.8k stars 290 forks source link

Back-off restarting failed container nvidia-device-plugin-ctr #699

Open A-Akhil opened 6 months ago

A-Akhil commented 6 months ago

When we are trying to install Kubernets(K8s) in DGX A100 server at the time of Helm install for nvidia-device-plugin we are getting the following error

kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE kube-flannel kube-flannel-ds-2ss2c 1/1 Running 1 (3d21h ago) 3d22h kube-flannel kube-flannel-ds-9cwh9 1/1 Running 0 3d22h kube-system coredns-787d4945fb-9rcpx 1/1 Running 0 3d22h kube-system coredns-787d4945fb-9scjh 1/1 Running 0 3d22h kube-system etcd-sybsramma-virtual-machine 1/1 Running 0 3d22h kube-system gpu-feature-discovery-1712918793-gpu-feature-discovery-dr6ht 1/1 Running 0 3d21h kube-system gpu-feature-discovery-1712918793-node-feature-discovery-marrffd 1/1 Running 0 3d21h kube-system gpu-feature-discovery-1712918793-node-feature-discovery-womw95r 1/1 Running 1 (3d21h ago) 3d21h root@sybsramma-virtual-machine:~# 3d22h kube-system kube-controller-manager-sybsramma-virtual-machine 1/1 Running 0 3d22h kube-system kube-proxy-hnb42 1/1 Running 0 3d22h kube-system kube-proxy-s7q7h 1/1 Running 1 (3d21h ago) 3d22h kube-system kube-scheduler-sybsramma-virtual-machine 1/1 Running 0 3d22h kube-system nvidia-device-plugin-1712918682-bs4vf 0/1 CrashLoopBackOff 1104 (23s ago) 3d21h

kubectl describe pod nvidia-device-plugin-1712918682-bs4vf -n kube-system

Name: nvidia-device-plugin-1712918682-bs4vf Namespace: kube-system Priority: 2000001000 Priority Class Name: system-node-critical Service Account: default Node: dgxa100/ Start Time: Fri, 12 Apr 2024 16:16:59 +0530 Labels: app.kubernetes.io/instance=nvidia-device-plugin-1712918682 app.kubernetes.io/name=nvidia-device-plugin controller-revision-hash=665b565fc7 pod-template-generation=1 Annotations: Status: Running IP:
IPs: IP:
Controlled By: DaemonSet/nvidia-device-plugin-1712918682 Containers: nvidia-device-plugin-ctr: Container ID: containerd://9ad6475f973adb6fb463acff145cb7609e0a2e728d12a0c4ae9cf77ed2201cde Image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0-rc.2 Image ID: nvcr.io/nvidia/k8s-device-plugin@sha256:0585da349f3cdca29747834e39ada56aed5e23ba363908fc526474d25aa61a75 Port: Host Port: Command: nvidia-device-plugin State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 1 Started: Tue, 16 Apr 2024 13:40:32 +0530 Finished: Tue, 16 Apr 2024 13:40:32 +0530 Ready: False Restart Count: 1100 Environment: MPS_ROOT: /run/nvidia/mps MIG_STRATEGY: single NVIDIA_MIG_MONITOR_DEVICES: all NVIDIA_VISIBLE_DEVICES: all NVIDIA_DRIVER_CAPABILITIES: compute,utility Mounts: /dev/shm from mps-shm (rw) /mps from mps-root (rw) /var/lib/kubelet/device-plugins from device-plugin (rw) /var/run/cdi from cdi-root (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r9php (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: device-plugin: Type: HostPath (bare host directory volume) Path: /var/lib/kubelet/device-plugins HostPathType: mps-root: Type: HostPath (bare host directory volume) Path: /run/nvidia/mps HostPathType: DirectoryOrCreate mps-shm: Type: HostPath (bare host directory volume) Path: /run/nvidia/mps/shm HostPathType: cdi-root: Type: HostPath (bare host directory volume) Path: /var/run/cdi HostPathType: DirectoryOrCreate kube-api-access-r9php: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: Tolerations: CriticalAddonsOnly op=Exists node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists nvidia.com/gpu:NoSchedule op=Exists Events: Type Reason Age From Message


Warning BackOff 3m20s (x25802 over 3d21h) kubelet Back-off restarting failed container nvidia-device-plugin-ctr in pod nvidia-device-plugin-1712918682-bs4vf_kube-system(7b6ec6ee-2aed-41b2-8b69-2975749172ec)

1. Quick Debug Information

kubectl version WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.15", GitCommit:"1649f592f1909b97aa3c2a0a8f968a3fd05a7b8b", GitTreeState:"clean", BuildDate:"2024-03-14T01:05:39Z", GoVersion:"go1.21.8", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.15", GitCommit:"1649f592f1909b97aa3c2a0a8f968a3fd05a7b8b", GitTreeState:"clean", BuildDate:"2024-03-14T00:54:27Z", GoVersion:"go1.21.8", Compiler:"gc", Platform:"linux/amd64"

shivamerla commented 5 months ago

@A-Akhil What does the log message from the device-plugin pod say? Also please report this here as it is standalone device-plugin installation.

A-Akhil commented 5 months ago

@shivamerla
This is the command which we used to install the nvidia-device plugin using helm helm install \ --version=0.15.0-rc.2 \ --generate-name \ --namespace kube-system \ --create-namespace \ --set migStrategy=single \ nvdp/nvidia-device-plugin

This is the log root@sybsramma-virtual-machine:~# kubectl logs nvidia-device-plugin-1712918682-bs4vf -n kube-system

I0417 03:40:28.204942 1 main.go:178] Starting FS watcher. I0417 03:40:28.205038 1 main.go:185] Starting OS watcher. I0417 03:40:28.205179 1 main.go:200] Starting Plugins. I0417 03:40:28.205216 1 main.go:257] Loading configuration. I0417 03:40:28.205615 1 main.go:265] Updating config with default resource matching patterns. I0417 03:40:28.205917 1 main.go:276] Running with config: { "version": "v1", "flags": { "migStrategy": "single", "failOnInitError": true, "mpsRoot": "/run/nvidia/mps", "nvidiaDriverRoot": "/", "gdsEnabled": false, "mofedEnabled": false, "useNodeFeatureAPI": null, "plugin": { "passDeviceSpecs": false, "deviceListStrategy": [ "envvar" ], "deviceIDStrategy": "uuid", "cdiAnnotationPrefix": "cdi.k8s.io/", "nvidiaCTKPath": "/usr/bin/nvidia-ctk", "containerDriverRoot": "/driver-root" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ], "mig": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} } } I0417 03:40:28.205925 1 main.go:279] Retrieving plugins. W0417 03:40:28.205971 1 factory.go:31] No valid resources detected, creating a null CDI handler I0417 03:40:28.205998 1 factory.go:104] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory I0417 03:40:28.206027 1 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found E0417 03:40:28.206033 1 factory.go:112] Incompatible platform detected E0417 03:40:28.206037 1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit? E0417 03:40:28.206041 1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites E0417 03:40:28.206046 1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start E0417 03:40:28.206051 1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes E0417 03:40:28.213668 1 main.go:132] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed root@sybsramma-virtual-machine:~#

and this is the describe ` root@sybsramma-virtual-machine:~# kubectl describe pod nvidia-device-plugin-1712918682-bs4vf -n kube-system Name: nvidia-device-plugin-1712918682-bs4vf Namespace: kube-system Priority: 2000001000 Priority Class Name: system-node-critical Service Account: default Node: dgxa100/172.16.0.32 Start Time: Fri, 12 Apr 2024 16:16:59 +0530 Labels: app.kubernetes.io/instance=nvidia-device-plugin-1712918682 app.kubernetes.io/name=nvidia-device-plugin controller-revision-hash=665b565fc7 pod-template-generation=1 Annotations: Status: Running IP: 10.244.1.47 IPs: IP: 10.244.1.47 Controlled By: DaemonSet/nvidia-device-plugin-1712918682 Containers: nvidia-device-plugin-ctr: Container ID: containerd://62c676b690d74f0591ba712abb5d0fb567c7ab0c9d65e56a188b5b51f9c65ade Image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0-rc.2 Image ID: nvcr.io/nvidia/k8s-device-plugin@sha256:0585da349f3cdca29747834e39ada56aed5e23ba363908fc526474d25aa61a75 Port: Host Port: Command: nvidia-device-plugin State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 1 Started: Wed, 17 Apr 2024 09:15:30 +0530 Finished: Wed, 17 Apr 2024 09:15:30 +0530 Ready: False Restart Count: 1330 Environment: MPS_ROOT: /run/nvidia/mps MIG_STRATEGY: single NVIDIA_MIG_MONITOR_DEVICES: all NVIDIA_VISIBLE_DEVICES: all NVIDIA_DRIVER_CAPABILITIES: compute,utility Mounts: /dev/shm from mps-shm (rw) /mps from mps-root (rw) /var/lib/kubelet/device-plugins from device-plugin (rw) /var/run/cdi from cdi-root (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r9php (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: device-plugin: Type: HostPath (bare host directory volume) Path: /var/lib/kubelet/device-plugins HostPathType: mps-root: Type: HostPath (bare host directory volume) Path: /run/nvidia/mps HostPathType: DirectoryOrCreate mps-shm: Type: HostPath (bare host directory volume) Path: /run/nvidia/mps/shm HostPathType: cdi-root: Type: HostPath (bare host directory volume) Path: /var/run/cdi HostPathType: DirectoryOrCreate kube-api-access-r9php: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: Tolerations: CriticalAddonsOnly op=Exists node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists nvidia.com/gpu:NoSchedule op=Exists Events: Type Reason Age From Message


Warning BackOff 4m13s (x31185 over 4d16h) kubelet Back-off restarting failed container nvidia-device-plugin-ctr in pod nvidia-device-plugin-1712918682-bs4vf_kube-system(7b6ec6ee-2aed-41b2-8b69-2975749172ec) root@sybsramma-virtual-machine:~#`

Kubernetes_Documentation.pdf

This is the documentation which i used to install the plugin In this we choosed mig enabled with same instance type for helm install

shivamerla commented 5 months ago

are NVIDIA drivers and container-toolkit setup correctly on the node?

A-Akhil commented 5 months ago

@shivamerla yea its setup properly