Open A-Akhil opened 6 months ago
@A-Akhil What does the log message from the device-plugin pod say? Also please report this here as it is standalone device-plugin installation.
@shivamerla
This is the command which we used to install the nvidia-device plugin using helm
helm install \
--version=0.15.0-rc.2 \
--generate-name \
--namespace kube-system \
--create-namespace \
--set migStrategy=single \
nvdp/nvidia-device-plugin
This is the log
root@sybsramma-virtual-machine:~# kubectl logs nvidia-device-plugin-1712918682-bs4vf -n kube-system
I0417 03:40:28.204942 1 main.go:178] Starting FS watcher. I0417 03:40:28.205038 1 main.go:185] Starting OS watcher. I0417 03:40:28.205179 1 main.go:200] Starting Plugins. I0417 03:40:28.205216 1 main.go:257] Loading configuration. I0417 03:40:28.205615 1 main.go:265] Updating config with default resource matching patterns. I0417 03:40:28.205917 1 main.go:276] Running with config: { "version": "v1", "flags": { "migStrategy": "single", "failOnInitError": true, "mpsRoot": "/run/nvidia/mps", "nvidiaDriverRoot": "/", "gdsEnabled": false, "mofedEnabled": false, "useNodeFeatureAPI": null, "plugin": { "passDeviceSpecs": false, "deviceListStrategy": [ "envvar" ], "deviceIDStrategy": "uuid", "cdiAnnotationPrefix": "cdi.k8s.io/", "nvidiaCTKPath": "/usr/bin/nvidia-ctk", "containerDriverRoot": "/driver-root" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ], "mig": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} } } I0417 03:40:28.205925 1 main.go:279] Retrieving plugins. W0417 03:40:28.205971 1 factory.go:31] No valid resources detected, creating a null CDI handler I0417 03:40:28.205998 1 factory.go:104] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory I0417 03:40:28.206027 1 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found E0417 03:40:28.206033 1 factory.go:112] Incompatible platform detected E0417 03:40:28.206037 1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit? E0417 03:40:28.206041 1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites E0417 03:40:28.206046 1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start E0417 03:40:28.206051 1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes E0417 03:40:28.213668 1 main.go:132] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed root@sybsramma-virtual-machine:~#
and this is the describe
`
root@sybsramma-virtual-machine:~# kubectl describe pod nvidia-device-plugin-1712918682-bs4vf -n kube-system
Name: nvidia-device-plugin-1712918682-bs4vf
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: default
Node: dgxa100/172.16.0.32
Start Time: Fri, 12 Apr 2024 16:16:59 +0530
Labels: app.kubernetes.io/instance=nvidia-device-plugin-1712918682
app.kubernetes.io/name=nvidia-device-plugin
controller-revision-hash=665b565fc7
pod-template-generation=1
Annotations:
Warning BackOff 4m13s (x31185 over 4d16h) kubelet Back-off restarting failed container nvidia-device-plugin-ctr in pod nvidia-device-plugin-1712918682-bs4vf_kube-system(7b6ec6ee-2aed-41b2-8b69-2975749172ec) root@sybsramma-virtual-machine:~#`
This is the documentation which i used to install the plugin In this we choosed mig enabled with same instance type for helm install
are NVIDIA drivers and container-toolkit setup correctly on the node?
@shivamerla yea its setup properly
When we are trying to install Kubernets(K8s) in DGX A100 server at the time of Helm install for nvidia-device-plugin we are getting the following error
kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE kube-flannel kube-flannel-ds-2ss2c 1/1 Running 1 (3d21h ago) 3d22h kube-flannel kube-flannel-ds-9cwh9 1/1 Running 0 3d22h kube-system coredns-787d4945fb-9rcpx 1/1 Running 0 3d22h kube-system coredns-787d4945fb-9scjh 1/1 Running 0 3d22h kube-system etcd-sybsramma-virtual-machine 1/1 Running 0 3d22h kube-system gpu-feature-discovery-1712918793-gpu-feature-discovery-dr6ht 1/1 Running 0 3d21h kube-system gpu-feature-discovery-1712918793-node-feature-discovery-marrffd 1/1 Running 0 3d21h kube-system gpu-feature-discovery-1712918793-node-feature-discovery-womw95r 1/1 Running 1 (3d21h ago) 3d21h root@sybsramma-virtual-machine:~# 3d22h kube-system kube-controller-manager-sybsramma-virtual-machine 1/1 Running 0 3d22h kube-system kube-proxy-hnb42 1/1 Running 0 3d22h kube-system kube-proxy-s7q7h 1/1 Running 1 (3d21h ago) 3d22h kube-system kube-scheduler-sybsramma-virtual-machine 1/1 Running 0 3d22h kube-system nvidia-device-plugin-1712918682-bs4vf 0/1 CrashLoopBackOff 1104 (23s ago) 3d21hkubectl describe pod nvidia-device-plugin-1712918682-bs4vf -n kube-system
Name: nvidia-device-plugin-1712918682-bs4vf Namespace: kube-system Priority: 2000001000 Priority Class Name: system-node-critical Service Account: default Node: dgxa100/ Start Time: Fri, 12 Apr 2024 16:16:59 +0530 Labels: app.kubernetes.io/instance=nvidia-device-plugin-1712918682 app.kubernetes.io/name=nvidia-device-plugin controller-revision-hash=665b565fc7 pod-template-generation=1 Annotations:
Status: Running
IP:
Host Port:
Command:
nvidia-device-plugin
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 16 Apr 2024 13:40:32 +0530
Finished: Tue, 16 Apr 2024 13:40:32 +0530
Ready: False
Restart Count: 1100
Environment:
MPS_ROOT: /run/nvidia/mps
MIG_STRATEGY: single
NVIDIA_MIG_MONITOR_DEVICES: all
NVIDIA_VISIBLE_DEVICES: all
NVIDIA_DRIVER_CAPABILITIES: compute,utility
Mounts:
/dev/shm from mps-shm (rw)
/mps from mps-root (rw)
/var/lib/kubelet/device-plugins from device-plugin (rw)
/var/run/cdi from cdi-root (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r9php (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
mps-root:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/mps
HostPathType: DirectoryOrCreate
mps-shm:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/mps/shm
HostPathType:
cdi-root:
Type: HostPath (bare host directory volume)
Path: /var/run/cdi
HostPathType: DirectoryOrCreate
kube-api-access-r9php:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors:
Tolerations: CriticalAddonsOnly op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
IPs: IP:
Controlled By: DaemonSet/nvidia-device-plugin-1712918682 Containers: nvidia-device-plugin-ctr: Container ID: containerd://9ad6475f973adb6fb463acff145cb7609e0a2e728d12a0c4ae9cf77ed2201cde Image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0-rc.2 Image ID: nvcr.io/nvidia/k8s-device-plugin@sha256:0585da349f3cdca29747834e39ada56aed5e23ba363908fc526474d25aa61a75 Port:
Warning BackOff 3m20s (x25802 over 3d21h) kubelet Back-off restarting failed container nvidia-device-plugin-ctr in pod nvidia-device-plugin-1712918682-bs4vf_kube-system(7b6ec6ee-2aed-41b2-8b69-2975749172ec)
1. Quick Debug Information
kubectl version WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.15", GitCommit:"1649f592f1909b97aa3c2a0a8f968a3fd05a7b8b", GitTreeState:"clean", BuildDate:"2024-03-14T01:05:39Z", GoVersion:"go1.21.8", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.15", GitCommit:"1649f592f1909b97aa3c2a0a8f968a3fd05a7b8b", GitTreeState:"clean", BuildDate:"2024-03-14T00:54:27Z", GoVersion:"go1.21.8", Compiler:"gc", Platform:"linux/amd64"