NVIDIA / k8s-dra-driver

Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes
Apache License 2.0
226 stars 41 forks source link

Error: unable to sync prepared devices from CRD: MIG devices found that aren't prepared to any claim #149

Closed sozercan closed 5 hours ago

sozercan commented 1 month ago

after creating MIG instances, kubelet plugin fails with Error: unable to sync prepared devices from CRD: MIG devices found that aren't prepared to any claim.

Kubelet plugin works if I don't set up MIG instances.

repro:

sudo nvidia-smi -i 0 -mig 1
sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19 
sudo nvidia-smi mig -cci
# this is from the `main` branch 
./demo/clusters/kind/create-cluster.sh
...
./demo/clusters/kind/build-dra-driver.sh
...
./demo/clusters/kind/install-dra-driver.sh
kubectl get pods -A
nvidia-dra-driver    nvidia-k8s-dra-driver-controller-6d5869d478-tpg7q              1/1     Running   0               8m59s
nvidia-dra-driver    nvidia-k8s-dra-driver-kubelet-plugin-8wmmz                     0/1     CrashLoopBackOff   4 (65s ago)     2m35s
kubectl logs -n nvidia-dra-driver nvidia-k8s-dra-driver-kubelet-plugin-8wmmz
Defaulted container "plugin" out of: plugin, init (init)
I0728 22:15:58.289802       1 device_state.go:146] using devRoot=/driver-root
Error: unable to sync prepared devices from CRD: MIG devices found that aren't prepared to any claim: map[GPU-fb94dd02-0d83-9a84-fc28-4f82083afef0:map[MIG-14f88cdc-5b6d-521d-ac11-75b761535917:0xc00068a4b0 MIG-15ac69e8-9597-5c97-a412-9136dc81cd5f:0xc00068a3f0 MIG-22a791bd-3caa-5f15-b652-7d380a11a28a:0xc00068a2d0 MIG-37869b9c-5b2f-51f4-8ed7-41ad74e2d4ee:0xc00068a390 MIG-4c221d9d-fff9-5493-aa05-3c5127484f6b:0xc00068a330 MIG-6bd870c3-ff58-5928-b591-8c4b546f7bbc:0xc00068a270 MIG-7f84882a-373b-548b-aab1-76d900f05615:0xc00068a450]]

nvidia-smi output:

$ docker ps
CONTAINER ID   IMAGE                  COMMAND                  CREATED          STATUS          PORTS                       NAMES
ca08963006f6   kindest/node:v1.29.1   "/usr/local/bin/entr…"   58 minutes ago   Up 57 minutes   127.0.0.1:36171->6443/tcp   k8s-dra-driver-cluster-control-plane
3b2538820a3d   kindest/node:v1.29.1   "/usr/local/bin/entr…"   58 minutes ago   Up 57 minutes                               k8s-dra-driver-cluster-worker

$ docker exec -it 3b2538820a3d bash
root@k8s-dra-driver-cluster-worker:/# nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          Off |   00000001:00:00.0 Off |                   On |
| N/A   31C    P0             43W /  300W |      88MiB /  81920MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  0    7   0   0  |              13MiB /  9728MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0    8   0   1  |              13MiB /  9728MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0    9   0   2  |              13MiB /  9728MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   11   0   3  |              13MiB /  9728MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   12   0   4  |              13MiB /  9728MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   13   0   5  |              13MiB /  9728MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   14   0   6  |              13MiB /  9728MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
shivanshuraj1333 commented 2 weeks ago

I'm also facing same problem

klueska commented 2 weeks ago

This is the result of a bug that would be fixed by: https://github.com/NVIDIA/k8s-dra-driver/pull/123

That said, the current code-base is slated to be rewritten to conform to the DRA APIs introduced in the latest Kubernetes 1.31 release.

This release does not have the ability to support dynamic MIG. This feature will be reintroduced once Kubernetes 1.32 comes out in December.

shivanshuraj1333 commented 2 weeks ago

@klueska One problem that I have faced is even though the MIGs are available as shown above, the kind guide doesn't work on an A100 GPU, I have 2 A100 GPUs with each 40Gb, and when I create the MIGs, and the capacity section in the kind nodes doesn't show the available GPUs. https://github.com/NVIDIA/k8s-dra-driver Do you have any idea why?

klueska commented 2 weeks ago

What do you mean by "when you create them"? There must be no MIGs precreated on the GPUs when the driver comes online (otherwise you will get the error above). The driver will create the MIG devices on the fly based on incoming requests for them.

shivanshuraj1333 commented 2 weeks ago

What do you mean by "when you create them"? There must be no MIGs precreated on the GPUs when the driver comes online (otherwise you will get the error above). The driver will create the MIG devices on the fly based on incoming requests for them. Yes, I realised that.

However, for some reason my pods are stuck in pending states and the resource claims are stuck in

root@wild-wind-3603273:~/k8s-dra-driver# k get resourceclaims -A
NAMESPACE   NAME                   RESOURCECLASSNAME   ALLOCATIONMODE         STATE     AGE
gpu-test1   pod1-gpu-bcwkk         gpu.nvidia.com      WaitForFirstConsumer   pending   3m45s
gpu-test1   pod2-gpu-64m6c         gpu.nvidia.com      WaitForFirstConsumer   pending   3m45s
gpu-test2   pod-shared-gpu-q9rlb   gpu.nvidia.com      WaitForFirstConsumer   pending   3m45s
gpu-test3   shared-gpu             gpu.nvidia.com      WaitForFirstConsumer   pending   3m45s

The Node configuration from the demo, results in something like this, which is missing the GPU as capacity, which I think is causing the problem, WDYT?

Addresses:
  InternalIP:  172.18.0.3
  Hostname:    k8s-dra-driver-cluster-worker
Capacity:
  cpu:                16
  ephemeral-storage:  383948072Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             125780508Ki
  pods:               110
Allocatable:
  cpu:                16
  ephemeral-storage:  383948072Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             125780508Ki
  pods:               110

For some reason, the GPUs are not available in my kind nodes, I think that's needed for driver to create the MIGs

klueska commented 2 weeks ago

DRA does not populate that because resources are not exposed via the extended resource APIs.

I'm not sure what is causing your pods to remain pending (without looking more deeply), but the lack of GPU resources visible in capacity/allocatable is expected.

shivanshuraj1333 commented 2 weeks ago

Is there a Slack channel where I can DM you, I won't mind sharing the VM creds (just so that you can have a quick look if you want to), I've spent hours trying to figure out the problem but everything seems to be good.

root@wild-wind-3603273:~# nvidia-smi -L
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-951c3156-dbe7-8da7-f8f1-a1c3ff7955f6)
GPU 1: NVIDIA A100-PCIE-40GB (UUID: GPU-fde15cf9-5d71-15af-1ca6-50934b08831d)
klueska commented 5 hours ago

The code has been updated to adhere to the latest DRA APIs in Kubernetes v1.31.

This is a major overhaul of the code base, including the removal of CRDs from which MIG devices are synced (resources are now advertised directly to the in-tree scheduler for allocation rather than allocated by a custom controller).

Please try updating to the latest and let me know if you still have issues.

One thing to note is that in Kubernets v1.31 dynamic MIG is not supported. You can pre-create a set of MIG devices (with mig-parted or the mig-manager in the GPU operator) and they will be available for allocation, but you can't have their creation triggered dynamically anymore. We hope to bring this support back in Kubernetes v1.32.

klueska commented 5 hours ago

Closing for now -- please reopen if you would like to discuss further / still have issues.

shivanshuraj1333 commented 5 hours ago

Nice, I'll try it out on my A100 cluster and see if it works... Thanks for the heads up Kevin!!

klueska commented 4 hours ago

The README here may be of particular interest to you as it shows how to both partition the GPUs to begin with as well as demo how to request them: https://github.com/NVIDIA/k8s-dra-driver/tree/main/demo/specs/quickstart