Open A-Akhil opened 1 month ago
@A-Akhil could you please provide information on the helm values that you are providing to the plugin? Do any of the plugin containers show any logs?
@elezar This is the command which we used to install the nvidia-device plugin using helm helm install \ --version=0.15.0-rc.2 \ --generate-name \ --namespace kube-system \ --create-namespace \ --set migStrategy=single \ nvdp/nvidia-device-plugin
This is the log
root@sybsramma-virtual-machine:~# kubectl logs nvidia-device-plugin-1712918682-bs4vf -n kube-system I0417 03:40:28.204942 1 main.go:178] Starting FS watcher. I0417 03:40:28.205038 1 main.go:185] Starting OS watcher. I0417 03:40:28.205179 1 main.go:200] Starting Plugins. I0417 03:40:28.205216 1 main.go:257] Loading configuration. I0417 03:40:28.205615 1 main.go:265] Updating config with default resource matching patterns. I0417 03:40:28.205917 1 main.go:276] Running with config: { "version": "v1", "flags": { "migStrategy": "single", "failOnInitError": true, "mpsRoot": "/run/nvidia/mps", "nvidiaDriverRoot": "/", "gdsEnabled": false, "mofedEnabled": false, "useNodeFeatureAPI": null, "plugin": { "passDeviceSpecs": false, "deviceListStrategy": [ "envvar" ], "deviceIDStrategy": "uuid", "cdiAnnotationPrefix": "cdi.k8s.io/", "nvidiaCTKPath": "/usr/bin/nvidia-ctk", "containerDriverRoot": "/driver-root" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ], "mig": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} } } I0417 03:40:28.205925 1 main.go:279] Retrieving plugins. W0417 03:40:28.205971 1 factory.go:31] No valid resources detected, creating a null CDI handler I0417 03:40:28.205998 1 factory.go:104] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory I0417 03:40:28.206027 1 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found E0417 03:40:28.206033 1 factory.go:112] Incompatible platform detected E0417 03:40:28.206037 1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit? E0417 03:40:28.206041 1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites E0417 03:40:28.206046 1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start E0417 03:40:28.206051 1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes E0417 03:40:28.213668 1 main.go:132] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed root@sybsramma-virtual-machine:~#
and this is the describe
`
root@sybsramma-virtual-machine:~# kubectl describe pod nvidia-device-plugin-1712918682-bs4vf -n kube-system
Name: nvidia-device-plugin-1712918682-bs4vf
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: default
Node: dgxa100/172.16.0.32
Start Time: Fri, 12 Apr 2024 16:16:59 +0530
Labels: app.kubernetes.io/instance=nvidia-device-plugin-1712918682
app.kubernetes.io/name=nvidia-device-plugin
controller-revision-hash=665b565fc7
pod-template-generation=1
Annotations:
Warning BackOff 4m13s (x31185 over 4d16h) kubelet Back-off restarting failed container nvidia-device-plugin-ctr in pod nvidia-device-plugin-1712918682-bs4vf_kube-system(7b6ec6ee-2aed-41b2-8b69-2975749172ec) root@sybsramma-virtual-machine:~#`
This is the documentation which i used to install the plugin In this we choosed mig enabled with same instance type for helm install
@A-Akhil have you installed the NVIDIA Container Toolkit on the host and configured containerd with this runtime?
Furthermore if the nvidia
runtime is not the default runtime for Containerd you would also need to set up and request a RuntimeClass.
See https://github.com/NVIDIA/k8s-device-plugin/issues/604#issuecomment-2097764513 and https://github.com/NVIDIA/k8s-device-plugin/issues/604#issuecomment-2097855058 for more information.
When we are trying to install Kubernets(K8s) in DGX A100 server at the time of Helm install for nvidia-device-plugin we are getting the following error
kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE kube-flannel kube-flannel-ds-2ss2c 1/1 Running 1 (3d21h ago) 3d22h kube-flannel kube-flannel-ds-9cwh9 1/1 Running 0 3d22h kube-system coredns-787d4945fb-9rcpx 1/1 Running 0 3d22h kube-system coredns-787d4945fb-9scjh 1/1 Running 0 3d22h kube-system etcd-sybsramma-virtual-machine 1/1 Running 0 3d22h kube-system gpu-feature-discovery-1712918793-gpu-feature-discovery-dr6ht 1/1 Running 0 3d21h kube-system gpu-feature-discovery-1712918793-node-feature-discovery-marrffd 1/1 Running 0 3d21h kube-system gpu-feature-discovery-1712918793-node-feature-discovery-womw95r 1/1 Running 1 (3d21h ago) 3d21h root@sybsramma-virtual-machine:~# 3d22h kube-system kube-controller-manager-sybsramma-virtual-machine 1/1 Running 0 3d22h kube-system kube-proxy-hnb42 1/1 Running 0 3d22h kube-system kube-proxy-s7q7h 1/1 Running 1 (3d21h ago) 3d22h kube-system kube-scheduler-sybsramma-virtual-machine 1/1 Running 0 3d22h kube-system nvidia-device-plugin-1712918682-bs4vf 0/1 CrashLoopBackOff 1104 (23s ago) 3d21hkubectl describe pod nvidia-device-plugin-1712918682-bs4vf -n kube-system
Name: nvidia-device-plugin-1712918682-bs4vf Namespace: kube-system Priority: 2000001000 Priority Class Name: system-node-critical Service Account: default Node: dgxa100/ Start Time: Fri, 12 Apr 2024 16:16:59 +0530 Labels: app.kubernetes.io/instance=nvidia-device-plugin-1712918682 app.kubernetes.io/name=nvidia-device-plugin controller-revision-hash=665b565fc7 pod-template-generation=1 Annotations:
Status: Running
IP:
Host Port:
Command:
nvidia-device-plugin
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 16 Apr 2024 13:40:32 +0530
Finished: Tue, 16 Apr 2024 13:40:32 +0530
Ready: False
Restart Count: 1100
Environment:
MPS_ROOT: /run/nvidia/mps
MIG_STRATEGY: single
NVIDIA_MIG_MONITOR_DEVICES: all
NVIDIA_VISIBLE_DEVICES: all
NVIDIA_DRIVER_CAPABILITIES: compute,utility
Mounts:
/dev/shm from mps-shm (rw)
/mps from mps-root (rw)
/var/lib/kubelet/device-plugins from device-plugin (rw)
/var/run/cdi from cdi-root (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r9php (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
mps-root:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/mps
HostPathType: DirectoryOrCreate
mps-shm:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/mps/shm
HostPathType:
cdi-root:
Type: HostPath (bare host directory volume)
Path: /var/run/cdi
HostPathType: DirectoryOrCreate
kube-api-access-r9php:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors:
Tolerations: CriticalAddonsOnly op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
IPs: IP:
Controlled By: DaemonSet/nvidia-device-plugin-1712918682 Containers: nvidia-device-plugin-ctr: Container ID: containerd://9ad6475f973adb6fb463acff145cb7609e0a2e728d12a0c4ae9cf77ed2201cde Image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0-rc.2 Image ID: nvcr.io/nvidia/k8s-device-plugin@sha256:0585da349f3cdca29747834e39ada56aed5e23ba363908fc526474d25aa61a75 Port:
Warning BackOff 3m20s (x25802 over 3d21h) kubelet Back-off restarting failed container nvidia-device-plugin-ctr in pod nvidia-device-plugin-1712918682-bs4vf_kube-system(7b6ec6ee-2aed-41b2-8b69-2975749172ec)
1. Quick Debug Information
kubectl version WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.15", GitCommit:"1649f592f1909b97aa3c2a0a8f968a3fd05a7b8b", GitTreeState:"clean", BuildDate:"2024-03-14T01:05:39Z", GoVersion:"go1.21.8", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.15", GitCommit:"1649f592f1909b97aa3c2a0a8f968a3fd05a7b8b", GitTreeState:"clean", BuildDate:"2024-03-14T00:54:27Z", GoVersion:"go1.21.8", Compiler:"gc", Platform:"linux/amd64"