Open asm582 opened 5 months ago
@klueska ptal
@mrunalp @asm582 could you clarify what you would expect the behaviour to be?
I have reviewed the code and we should be generating SOME labels if mig-strategy=mixed
even if all MIG-enabled devices are empty. Which labels are you looking for specifically?
Could you provide the logs of a GFD pod that is not in the running state for this configuration?
They want a new MIG strategy that allows a GPU to exist in MIG mode without having any MIG devices configured on it. Right now we error out if this is the case. The purpose being so that they can dynamically create MIG devices on such GPUs, kick the plugin to restart, and then start advertising those MIG devices as "mixed-strategy-style" resources.
To support this properly, the MIG manager will also need to be updated to:
In other words -- don't persist any MIG configs that get applied. Only apply a config at the moment it is requested.
Hi @elezar Do you need more details on this?
I chatted with @elezar and I am going to work on this later this week or next.
I was also thinking about my comment above a bit more, and I actually think we don't need any mig-manager changes. To avoid having the mig-manager reapply its "known" configuration after a reboot / restart, you simply have to remove the label.
Meaning that your controller should apply a label to set up a specific config (presumably with some GPUs set to MIG enabled and some set to MIG disabled), wait for the mig-manager to complete, and then remove the label. If no label is set, the mig-manager simply doesn't try and apply any config.
In the end, I decided to just change the mixed strategy to warn when no MIG devices are configured instead of erroring: https://github.com/NVIDIA/k8s-device-plugin/pull/806
Thank you for creating the fix for the mixed strategy. we are facing issues using MIG Manager:
mig.config
as all-enabled
nvidia.com/mig.config.state=success
we remove the label mig.config
from the nodenvidia.com/mig.config=all-disabled
labels and disables MIG on the GPU.We need a setup where the MIG partitions are untouched when the MIG manager pod restarts. @klueska @elezar
The mig-manager itself doesn't apply any "default" config. It only applies a change if a label is set. If no label is set, it will just sit in a wait loop, waiting for one to be set with some config.
@cdesiniotis does the operator force the mig.config
label to be set to all-disabled
if it gets unset by an external user?
One way to work around this (and possibly even the "right" solution going forward) would be to deploy the operator with nvidia.com/mig.config=all-enabled
on the nodes you want configured that way, wait for nvidia.com/mig.config.state=success
on those nodes, and then disable the mig-manager altogether on those nodes by setting the label nvidia.com/gpu.deploy.mig-manager=false
.
@cdesiniotis does the operator force the mig.config label to be set to all-disabled if it gets unset by an external user?
Yes, see https://github.com/NVIDIA/gpu-operator/blob/main/controllers/state_manager.go#L538-L546
@asm582 if you set migManager.config.default=""
, then the operator will not apply a default label. So after you remove the mig.config
label, the label should remain unset.
Let me list some of my findings. Note though that this is OpenShift - not vanilla K8s. The NVIDIA GPU operator is 24.3.
migManager.config.default=""
, no MIG manager pod is created. So if I want to enable MIG, for instance, I have to explicitly tell the operator to deploy the MIG manager kubectl label node $node nvidia.com/gpu.deploy.mig-manager=true --overwrite
. Then the MIG manager starts and I can enable MIG without creating any slices kubectl label node $node nvidia.com/mig.config=all-enabled --overwrite
(MIG strategy is mixed
).CrashLoopBackOff
and the operator validator pod will keep waiting for initialization. This makes sense, but is a bit annoying. More on this later.kubectl label node $node nvidia.com/gpu.deploy.mig-manager=false --overwrite
kubectl label node $node nvidia.com/mig.config-
kubectl label node $node nvidia.com/gpu.deploy.operator-validator=false --overwrite
. This removes at least the operator validator pod. However, if the operator validator is disabled before the device plugin has a chance to start, the plugin will never run. This should be kept in mind.nvidia.com/mig-1g.5gb: 1
and created a MIG slice manually to satisfy that, using nvidia-smi
. Just for testing. I had to delete the device plugin pod and let it be re-created, so that it picks up the MIG changes. The pod had remained pending until the device plugin advertised the nvidia.com/mig-1g.5gb
capacity. After that the workload pod (vectoradd) ran successfully.nvidia-smi mig -dgi -gi 9
or nvidia-smi mig -dgi
). However, I got this error:
Unable to destroy GPU instance ID 9 from GPU 0: In use by another client
Failed to destroy GPU instances: In use by another client
command terminated with exit code 19
The error persisted no matter what "client" I tried to stop. I disabled DCGM/DCGM exporter, the validators, even the device plugin, but nothing has helped. I believe at this point I can only reset the GPU before I can delete the MIG slice.
Thanks @empovit. @klueska @cdesiniotis any pointers that can help us with step 6 of the above comment?
You cannot delete a gi
until you have deleted its ci
.
@klueska thanks! You're absolutely right. So one problem less :) And a major one.
Everything except 4 seems to be expected (if not ideal).
@cdesiniotis @tariq1890 do you know why (4) might be happening?
Actually, if I recall correctly the validator pod validates "everything", not just CUDA workloads. So I'm guessing the plugin is waiting for the toolkit to be validated, which never happens if the validator pod doesn't run it.
Correct, the device-plugin does not run until the toolkit installation is validated by the validator pod.
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior.
The mixed strategy requires at least one MIG slice to be available on the device for K8s-device-plugin and GPU-feature-discovery pods to be in state Running. In our use case, we want to advertise MIG slices created dynamically and there would be scenarios in our setup where GPU devices will be MIG enabled but no actual MIG slices would be present on the device.
We want a mechanism or new strategy to handle dynamic MIG creation use cases.