Closed lengrongfu closed 5 months ago
@lengrongfu i assume you are applying MIG profiles through nvidia.com/mig.config
label on the node. To disable MIG you can set this label to all-disabled
and nvidia.com/mig.config.state
will indicate the result of it. These labels should be used to track the current status. nvidia.com/gpu.deploy.mig-manager
label indicates that GPU Operator will deploy MIG Manager Daemonset Pod on the node as there are MIG capable GPUs. This doesn't indicate the current state on the node.
Thank you for your reply.
If the user previously turned on MIG and specified nvidia.com/mig.config
, and then turned off the MIG function, will nvidia.com/mig.config
be automatically updated to all-disabled
?
What do you mean by "and then turned off the MIG function". The way you turn off MIG is to set nvidia.com/mig.config=all-disabled
.
What do you mean by "and then turned off the MIG function". The way you turn off MIG is to set
nvidia.com/mig.config=all-disabled
.
Because i use gpu-operator
to manage MIG enable/disable, if i no longer in use MIG model, i just need update clusterpolicies
CR field migManager.enable: false
, i don't change nvidia.com/mig.config
this lable.
migManager.enable=false
will disable the mig manager, but not disable MIG on the node. You would need to first set nvidia.com/mig.config=all-disabled
to trigger the mig-manager to disable MIG on all the nodes, and only once that completed, set migManager.enable=false
.
All of that said, what is the reason for you to set migManager.enable=false
? The mig manager typically remains enabled on all nodes that are "MIG capable", providing an automated way to control how you want MIG configured on those nodes (including disabling MIG on one or all of the GPUs on the node via the nvidia.com/mig.config
label.
Thank you for your patient reply. I now understand the design idea here. It was the wrong way I used it before. I can solve this problem by following the steps you mentioned. Thank you again.
No problem. Happy to help.
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
2. Issue or feature description
I hope to use this label to detect whether the current node GPU has
mig
enabled.After I shut down a MIG device, the
nvidia.com/gpu.deploy.mig-manager
label still exists on the node's label, and its value is true.3. Steps to reproduce the issue
nvidia.com/gpu.deploy.mig-manager
value.