NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.86k stars 635 forks source link

New strategy for MIG enable GPUs that will advertise no MIG slices if not created #767

Open asm582 opened 5 months ago

asm582 commented 5 months ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior.

The mixed strategy requires at least one MIG slice to be available on the device for K8s-device-plugin and GPU-feature-discovery pods to be in state Running. In our use case, we want to advertise MIG slices created dynamically and there would be scenarios in our setup where GPU devices will be MIG enabled but no actual MIG slices would be present on the device.

We want a mechanism or new strategy to handle dynamic MIG creation use cases.

mrunalp commented 5 months ago

@klueska ptal

elezar commented 5 months ago

@mrunalp @asm582 could you clarify what you would expect the behaviour to be?

I have reviewed the code and we should be generating SOME labels if mig-strategy=mixed even if all MIG-enabled devices are empty. Which labels are you looking for specifically?

Could you provide the logs of a GFD pod that is not in the running state for this configuration?

klueska commented 5 months ago

They want a new MIG strategy that allows a GPU to exist in MIG mode without having any MIG devices configured on it. Right now we error out if this is the case. The purpose being so that they can dynamically create MIG devices on such GPUs, kick the plugin to restart, and then start advertising those MIG devices as "mixed-strategy-style" resources.

klueska commented 5 months ago

To support this properly, the MIG manager will also need to be updated to:

  1. Allow one to trigger a MIG mode change via a label (as it does today); BUT
  2. NOT reapply this configuration if/when the MIG manager is restarted

In other words -- don't persist any MIG configs that get applied. Only apply a config at the moment it is requested.

asm582 commented 5 months ago

Hi @elezar Do you need more details on this?

klueska commented 5 months ago

I chatted with @elezar and I am going to work on this later this week or next.

I was also thinking about my comment above a bit more, and I actually think we don't need any mig-manager changes. To avoid having the mig-manager reapply its "known" configuration after a reboot / restart, you simply have to remove the label.

Meaning that your controller should apply a label to set up a specific config (presumably with some GPUs set to MIG enabled and some set to MIG disabled), wait for the mig-manager to complete, and then remove the label. If no label is set, the mig-manager simply doesn't try and apply any config.

klueska commented 4 months ago

In the end, I decided to just change the mixed strategy to warn when no MIG devices are configured instead of erroring: https://github.com/NVIDIA/k8s-device-plugin/pull/806

asm582 commented 4 months ago

Thank you for creating the fix for the mixed strategy. we are facing issues using MIG Manager:

We need a setup where the MIG partitions are untouched when the MIG manager pod restarts. @klueska @elezar

klueska commented 4 months ago

The mig-manager itself doesn't apply any "default" config. It only applies a change if a label is set. If no label is set, it will just sit in a wait loop, waiting for one to be set with some config.

@cdesiniotis does the operator force the mig.config label to be set to all-disabled if it gets unset by an external user?

klueska commented 4 months ago

One way to work around this (and possibly even the "right" solution going forward) would be to deploy the operator with nvidia.com/mig.config=all-enabled on the nodes you want configured that way, wait for nvidia.com/mig.config.state=success on those nodes, and then disable the mig-manager altogether on those nodes by setting the label nvidia.com/gpu.deploy.mig-manager=false.

cdesiniotis commented 4 months ago

@cdesiniotis does the operator force the mig.config label to be set to all-disabled if it gets unset by an external user?

Yes, see https://github.com/NVIDIA/gpu-operator/blob/main/controllers/state_manager.go#L538-L546

cdesiniotis commented 4 months ago

@asm582 if you set migManager.config.default="", then the operator will not apply a default label. So after you remove the mig.config label, the label should remain unset.

empovit commented 4 months ago

Let me list some of my findings. Note though that this is OpenShift - not vanilla K8s. The NVIDIA GPU operator is 24.3.

  1. When the cluster policy is deployed with migManager.config.default="", no MIG manager pod is created. So if I want to enable MIG, for instance, I have to explicitly tell the operator to deploy the MIG manager kubectl label node $node nvidia.com/gpu.deploy.mig-manager=true --overwrite. Then the MIG manager starts and I can enable MIG without creating any slices kubectl label node $node nvidia.com/mig.config=all-enabled --overwrite (MIG strategy is mixed).
  2. When there are no MIG slices while MIG is enabled on the GPU, the CUDA validator pod will CrashLoopBackOff and the operator validator pod will keep waiting for initialization. This makes sense, but is a bit annoying. More on this later.
  3. Now, when MIG is enabled through the MIG manager, we can disable the manager as suggested, so that it doesn't interfere with other ways to manage MIG slices. It works.
    kubectl label node $node nvidia.com/gpu.deploy.mig-manager=false --overwrite
    kubectl label node $node nvidia.com/mig.config-
  4. I tried to disable the validator to get rid of the error status: kubectl label node $node nvidia.com/gpu.deploy.operator-validator=false --overwrite. This removes at least the operator validator pod. However, if the operator validator is disabled before the device plugin has a chance to start, the plugin will never run. This should be kept in mind.
  5. I deployed a workload pod that requested nvidia.com/mig-1g.5gb: 1 and created a MIG slice manually to satisfy that, using nvidia-smi. Just for testing. I had to delete the device plugin pod and let it be re-created, so that it picks up the MIG changes. The pod had remained pending until the device plugin advertised the nvidia.com/mig-1g.5gb capacity. After that the workload pod (vectoradd) ran successfully.
  6. After deleting the workload pod, and in line with the manual testing, I tried to delete the MIG partition (nvidia-smi mig -dgi -gi 9 or nvidia-smi mig -dgi). However, I got this error:
    Unable to destroy GPU instance ID  9 from GPU  0: In use by another client
    Failed to destroy GPU instances: In use by another client
    command terminated with exit code 19

    The error persisted no matter what "client" I tried to stop. I disabled DCGM/DCGM exporter, the validators, even the device plugin, but nothing has helped. I believe at this point I can only reset the GPU before I can delete the MIG slice.

asm582 commented 4 months ago

Thanks @empovit. @klueska @cdesiniotis any pointers that can help us with step 6 of the above comment?

klueska commented 4 months ago

You cannot delete a gi until you have deleted its ci.

empovit commented 4 months ago

@klueska thanks! You're absolutely right. So one problem less :) And a major one.

klueska commented 4 months ago

Everything except 4 seems to be expected (if not ideal).

@cdesiniotis @tariq1890 do you know why (4) might be happening?

klueska commented 4 months ago

Actually, if I recall correctly the validator pod validates "everything", not just CUDA workloads. So I'm guessing the plugin is waiting for the toolkit to be validated, which never happens if the validator pod doesn't run it.

cdesiniotis commented 3 months ago

Correct, the device-plugin does not run until the toolkit installation is validated by the validator pod.

github-actions[bot] commented 4 weeks ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.