Closed summerisc closed 2 years ago
Hi @summerisc which version of the device plugin is this, and what does nvidia-smi
show on one of the affected hosts?
I am using nvcr.io/nvidia/k8s-device-plugin:v0.9.0 this image. Nvidia-smi result Cuda/driver and GPU are the same across the nodes in the cluster.
Issue or feature description I have a four two-card cluster, it should show something like this after installing the plugin: which is working fine, but after I reboot any of one servers, for example, 1-9, the plugin cannot report the correct number of GPU, like this(I rebooted 1-9):
Not only 1-9 shows 0(even after the reboot and 1-9 back online), 1-10 got effected and shows 0 GPUs(I didn't do anything on that node) Any ideas why this happened? Appriciate any insights.
Steps to reproduce the issue
sudo journalctl -r -u kubelet
)Additional information that might help better understand your environment and reproduce the bug:
docker version
Version: 20.10.6 API version: 1.41uname -a
Ubuntu 18.04.1nvidia-container-cli -V
version: 1.3.3