NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.77k stars 286 forks source link

nvidia.com/gpu.deploy.mig-manager label not delete #701

Closed lengrongfu closed 5 months ago

lengrongfu commented 5 months ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

2. Issue or feature description

I hope to use this label to detect whether the current node GPU has mig enabled.

After I shut down a MIG device, the nvidia.com/gpu.deploy.mig-manager label still exists on the node's label, and its value is true.

3. Steps to reproduce the issue

shivamerla commented 5 months ago

@lengrongfu i assume you are applying MIG profiles through nvidia.com/mig.config label on the node. To disable MIG you can set this label to all-disabled and nvidia.com/mig.config.state will indicate the result of it. These labels should be used to track the current status. nvidia.com/gpu.deploy.mig-manager label indicates that GPU Operator will deploy MIG Manager Daemonset Pod on the node as there are MIG capable GPUs. This doesn't indicate the current state on the node.

lengrongfu commented 5 months ago

Thank you for your reply. If the user previously turned on MIG and specified nvidia.com/mig.config, and then turned off the MIG function, will nvidia.com/mig.config be automatically updated to all-disabled?

klueska commented 5 months ago

What do you mean by "and then turned off the MIG function". The way you turn off MIG is to set nvidia.com/mig.config=all-disabled.

lengrongfu commented 5 months ago

What do you mean by "and then turned off the MIG function". The way you turn off MIG is to set nvidia.com/mig.config=all-disabled.

Because i use gpu-operator to manage MIG enable/disable, if i no longer in use MIG model, i just need update clusterpolicies CR field migManager.enable: false, i don't change nvidia.com/mig.config this lable.

klueska commented 5 months ago

migManager.enable=false will disable the mig manager, but not disable MIG on the node. You would need to first set nvidia.com/mig.config=all-disabled to trigger the mig-manager to disable MIG on all the nodes, and only once that completed, set migManager.enable=false.

All of that said, what is the reason for you to set migManager.enable=false? The mig manager typically remains enabled on all nodes that are "MIG capable", providing an automated way to control how you want MIG configured on those nodes (including disabling MIG on one or all of the GPUs on the node via the nvidia.com/mig.config label.

lengrongfu commented 5 months ago

Thank you for your patient reply. I now understand the design idea here. It was the wrong way I used it before. I can solve this problem by following the steps you mentioned. Thank you again.

klueska commented 5 months ago

No problem. Happy to help.