NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.76k stars 618 forks source link

Support automatic discovery of MIG devices #992

Open DrAuYueng opened 1 week ago

DrAuYueng commented 1 week ago

Using k8s-device-plugin in our kubernetes cluster, we found that in MIG mode:

  1. The device plug-in instance corresponding to the newly created GI is not started
  2. The status of the newly created CI in the node is not displayed

When we delete the Pod corresponding to k8s-device-plugin and trigger a rebuild, the resources are displayed normally. It seems that the newly created MIG resources are not automatically discovered.

klueska commented 1 week ago

That is correct. The device-plugin needs to be restarted after a MIG reconfiguration.

If you use the GPU operator, this process is automated for you by a component called the mig-manager, so that you don't have to manager this complexity yourself.

Using the mig-manager you can dynamically reconfiguration the set the available MIG devices on a node by setting a node-label. Details can be found here: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html#example-reconfiguring-mig-profiles