nvidia-device-plugin daemonset has 0 desired and no pod is launched

NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes

Apache License 2.0

2.7k stars 614 forks source link

nvidia-device-plugin daemonset has 0 desired and no pod is launched #315

Open blackjack2015 opened 2 years ago

blackjack2015 commented 2 years ago

Thanks for the brilliant tool to deploy GPU-enabled pods by k8s. I have successfully installed all the prerequisites (including docker, nvidia-docker2, kubernetes). Some system and software information is as follows:

GPU device: Nvidia GeForce 2070 SUPER Driver version: 515.48.07 Docker version: 20.10.17 Kubernetes version: 1.24.2

The /etc/docker/daemon.json has been edited as follows:

I have also checked that nvidia docker runs successfully with "docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi".

After I executed the following instruction to deploy "nvidia-device-plugin-daemonset": kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.2/nvidia-device-plugin.yml

Then I checked the daemonset status with "kubectl get daemonset -A" and had:

The pod information is:

It seems that no pod of "nvidia-device-plugin" is launched.

Would you mind giving some suggestions to solve this? Thank you!

aditya2803 commented 1 year ago

Hi, were you able to solve this ? @blackjack2015 I am stuck in the same spot.

anibali commented 1 year ago

Double-check that pods can be scheduled to your node. I forgot to remove the node-role.kubernetes.io/control-plane taint and was having this problem.

blackjack2015 commented 1 year ago

Double-check that pods can be scheduled to your node. I forgot to remove the node-role.kubernetes.io/control-plane taint and was having this problem.

I have certainly confirmed this. My current solution is just using kubernetes 1.22 instead of 1.24. The point is that the versions above 1.22 apply contained as the default container manager, while 1.22 applies dockerd.

varskann commented 1 year ago

Hey! Any update on this issue? @blackjack2015 @anibali

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

v1nsai commented 1 month ago

I ran into this issue using Talos Kubernetes and ended up manually adding the "nvidia.com/gpu.present=true" label to my node, which concerns me because there should be a lot of other nvidia labels automatically added by....something that appears to have failed to do so. But on the other hand, everything works now so 🤷🏻

AngellusMortis commented 1 month ago

This seems to be an issue starting with device plugin v0.15.0. Adding the label @v1nsai mentioned makes the daemonset select the right nodes.