Closed sozercan closed 5 hours ago
I'm also facing same problem
This is the result of a bug that would be fixed by: https://github.com/NVIDIA/k8s-dra-driver/pull/123
That said, the current code-base is slated to be rewritten to conform to the DRA APIs introduced in the latest Kubernetes 1.31 release.
This release does not have the ability to support dynamic MIG. This feature will be reintroduced once Kubernetes 1.32 comes out in December.
@klueska One problem that I have faced is even though the MIGs are available as shown above, the kind guide doesn't work on an A100 GPU, I have 2 A100 GPUs with each 40Gb, and when I create the MIGs, and the capacity section in the kind nodes doesn't show the available GPUs. https://github.com/NVIDIA/k8s-dra-driver Do you have any idea why?
What do you mean by "when you create them"? There must be no MIGs precreated on the GPUs when the driver comes online (otherwise you will get the error above). The driver will create the MIG devices on the fly based on incoming requests for them.
What do you mean by "when you create them"? There must be no MIGs precreated on the GPUs when the driver comes online (otherwise you will get the error above). The driver will create the MIG devices on the fly based on incoming requests for them. Yes, I realised that.
However, for some reason my pods are stuck in pending states and the resource claims are stuck in
root@wild-wind-3603273:~/k8s-dra-driver# k get resourceclaims -A
NAMESPACE NAME RESOURCECLASSNAME ALLOCATIONMODE STATE AGE
gpu-test1 pod1-gpu-bcwkk gpu.nvidia.com WaitForFirstConsumer pending 3m45s
gpu-test1 pod2-gpu-64m6c gpu.nvidia.com WaitForFirstConsumer pending 3m45s
gpu-test2 pod-shared-gpu-q9rlb gpu.nvidia.com WaitForFirstConsumer pending 3m45s
gpu-test3 shared-gpu gpu.nvidia.com WaitForFirstConsumer pending 3m45s
The Node configuration from the demo, results in something like this, which is missing the GPU as capacity, which I think is causing the problem, WDYT?
Addresses:
InternalIP: 172.18.0.3
Hostname: k8s-dra-driver-cluster-worker
Capacity:
cpu: 16
ephemeral-storage: 383948072Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 125780508Ki
pods: 110
Allocatable:
cpu: 16
ephemeral-storage: 383948072Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 125780508Ki
pods: 110
For some reason, the GPUs are not available in my kind nodes, I think that's needed for driver to create the MIGs
DRA does not populate that because resources are not exposed via the extended resource APIs.
I'm not sure what is causing your pods to remain pending (without looking more deeply), but the lack of GPU resources visible in capacity/allocatable is expected.
Is there a Slack channel where I can DM you, I won't mind sharing the VM creds (just so that you can have a quick look if you want to), I've spent hours trying to figure out the problem but everything seems to be good.
root@wild-wind-3603273:~# nvidia-smi -L
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-951c3156-dbe7-8da7-f8f1-a1c3ff7955f6)
GPU 1: NVIDIA A100-PCIE-40GB (UUID: GPU-fde15cf9-5d71-15af-1ca6-50934b08831d)
The code has been updated to adhere to the latest DRA APIs in Kubernetes v1.31.
This is a major overhaul of the code base, including the removal of CRDs from which MIG devices are synced (resources are now advertised directly to the in-tree scheduler for allocation rather than allocated by a custom controller).
Please try updating to the latest and let me know if you still have issues.
One thing to note is that in Kubernets v1.31 dynamic MIG is not supported. You can pre-create a set of MIG devices (with mig-parted or the mig-manager in the GPU operator) and they will be available for allocation, but you can't have their creation triggered dynamically anymore. We hope to bring this support back in Kubernetes v1.32.
Closing for now -- please reopen if you would like to discuss further / still have issues.
Nice, I'll try it out on my A100 cluster and see if it works... Thanks for the heads up Kevin!!
The README here may be of particular interest to you as it shows how to both partition the GPUs to begin with as well as demo how to request them: https://github.com/NVIDIA/k8s-dra-driver/tree/main/demo/specs/quickstart
after creating MIG instances, kubelet plugin fails with
Error: unable to sync prepared devices from CRD: MIG devices found that aren't prepared to any claim
.Kubelet plugin works if I don't set up MIG instances.
repro:
nvidia-smi
output: