Open DrAuYueng opened 1 week ago
That is correct. The device-plugin needs to be restarted after a MIG reconfiguration.
If you use the GPU operator, this process is automated for you by a component called the mig-manager
, so that you don't have to manager this complexity yourself.
Using the mig-manager you can dynamically reconfiguration the set the available MIG devices on a node by setting a node-label. Details can be found here: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html#example-reconfiguring-mig-profiles
Using k8s-device-plugin in our kubernetes cluster, we found that in MIG mode:
When we delete the Pod corresponding to k8s-device-plugin and trigger a rebuild, the resources are displayed normally. It seems that the newly created MIG resources are not automatically discovered.