NVIDIA / k8s-dra-driver

Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes
Apache License 2.0
226 stars 41 forks source link

DRA driver does not pick up all GPUs on the node #32

Open asm582 opened 9 months ago

asm582 commented 9 months ago

I have enabled MIG mode on both GPUs on a single node but the nas object shows one of the GPUs is not mig enabled:

  Allocatable Devices:
    Gpu:
      Architecture:             Ampere
      Brand:                    Nvidia
      Cuda Compute Capability:  8.0
      Index:                    0
      Memory Bytes:             85899345920
      Mig Enabled:              true
      Product Name:             NVIDIA A100 80GB PCIe
      Uuid:                     GPU-1a9afbae-5932-54f8-c2c4-a863888d45bb
    Gpu:
      Architecture:             Ampere
      Brand:                    Nvidia
      Cuda Compute Capability:  8.0
      Index:                    1
      Memory Bytes:             85899345920
      Mig Enabled:              false
      Product Name:             NVIDIA A100 80GB PCIe
      Uuid:                     GPU-713eebac-08df-c534-6c98-8d5055ca97a9

output of nvidia-smi:

[root@nvd-srv-02 k8s-dra-driver]# nvidia-smi
Thu Nov 30 11:15:31 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          On  | 00000000:17:00.0 Off |                   On |
| N/A   35C    P0              45W / 300W |      0MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          On  | 00000000:65:00.0 Off |                   On |
| N/A   35C    P0              46W / 300W |      0MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |

Can you please share how can nas object be updated correctly?

klueska commented 9 months ago

Hmm, this is unexpected. We just pull the MIG state directly from NVML (the underlying library that nvidia-smi uses as well).

Is it possible that the plugin came online when only one was enabled and the other wasn't? The plugin doesn't do any real-time reconciliation the GPU state -- the only way to get it to update is to restart the plugin.

So can you try restarting the plugin?

And ff that doesn't work, can you try deleting the NAS object and then restarting the plugin? This shouldn't be necessary, but I'm curious if it then resolves the issue or not.

asm582 commented 9 months ago

Thanks, I will delete the cluster and create plus reinstall the dra driver.

asm582 commented 9 months ago

update: recreated KinD cluster and re-deployed the previously built driver image, but still no luck:

[root@nvd-srv-02 k8s-dra-driver]# kubectl get pods -n nvidia-dra-driver
NAME                                               READY   STATUS    RESTARTS   AGE
nvidia-k8s-dra-driver-controller-6d6b45756-47khb   1/1     Running   0          2m
nvidia-k8s-dra-driver-kubelet-plugin-fkz4d         1/1     Running   0          2m
Spec:
  Allocatable Devices:
    Gpu:
      Architecture:             Ampere
      Brand:                    Nvidia
      Cuda Compute Capability:  8.0
      Index:                    1
      Memory Bytes:             85899345920
      Mig Enabled:              false
      Product Name:             NVIDIA A100 80GB PCIe
      Uuid:                     GPU-713eebac-08df-c534-6c98-8d5055ca97a9
    Gpu:
      Architecture:             Ampere
      Brand:                    Nvidia
      Cuda Compute Capability:  8.0
      Index:                    0
      Memory Bytes:             85899345920
      Mig Enabled:              true
      Product Name:             NVIDIA A100 80GB PCIe
      Uuid:                     GPU-1a9afbae-5932-54f8-c2c4-a863888d45bb
elezar commented 9 months ago

Could you confirm that running nvidia-smi in the kind worker node shows MIG as enabled?

asm582 commented 9 months ago

Thanks, could you please recommend the container image that I should use to run the command?

elezar commented 9 months ago

Running:

$ docker ps
CONTAINER ID   IMAGE                                                       COMMAND                  CREATED       STATUS       PORTS                       NAMES
0141a7534ebf   kindest/node:v1.27.1-v20230515-01914134-containerd_v1.7.1   "/usr/local/bin/entr…"   3 weeks ago   Up 3 weeks   127.0.0.1:44521->6443/tcp   k8s-dra-driver-cluster-control-plane
255a4db134af   kindest/node:v1.27.1-v20230515-01914134-containerd_v1.7.1   "/usr/local/bin/entr…"   3 weeks ago   Up 3 weeks                               k8s-dra-driver-cluster-worker

shows the kind nodes created by the demo.

Running:

$  docker exec -ti k8s-dra-driver-cluster-worker nvidia-smi

is equivalent to running nvidia-smi on a k8s node. The containerized kind worker node in this case.

asm582 commented 9 months ago

Thank you for sharing the command, below is the command output:

[root@nvd-srv-02 k8s-dra-driver]# docker ps
CONTAINER ID   IMAGE                                                       COMMAND                  CREATED              STATUS              PORTS                       NAMES
b5609ebd1675   kindest/node:v1.27.1-v20230515-01914134-containerd_v1.7.1   "/usr/local/bin/entr…"   About a minute ago   Up About a minute   127.0.0.1:34917->6443/tcp   k8s-dra-driver-cluster-control-plane
5ef8b180a289   kindest/node:v1.27.1-v20230515-01914134-containerd_v1.7.1   "/usr/local/bin/entr…"   About a minute ago   Up About a minute                               k8s-dra-driver-cluster-worker
[root@nvd-srv-02 k8s-dra-driver]# docker exec -ti k8s-dra-driver-cluster-worker nvidia-smi
Mon Dec  4 14:29:52 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          On  | 00000000:17:00.0 Off |                   On |
| N/A   36C    P0              45W / 300W |      0MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          On  | 00000000:65:00.0 Off |                   On |
| N/A   35C    P0              46W / 300W |      0MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  No MIG devices found                                                                 |
+---------------------------------------------------------------------------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
[root@nvd-srv-02 k8s-dra-driver]# kubectl get nodes
NAME                                   STATUS   ROLES           AGE     VERSION
k8s-dra-driver-cluster-control-plane   Ready    control-plane   2m36s   v1.27.1
k8s-dra-driver-cluster-worker          Ready    <none>          2m12s   v1.27.1
[root@nvd-srv-02 k8s-dra-driver]# kubectl describe nas/k8s-dra-driver-cluster-worker -n nvidia-dra-driver
Name:         k8s-dra-driver-cluster-worker
Namespace:    nvidia-dra-driver
Labels:       <none>
Annotations:  <none>
API Version:  nas.gpu.resource.nvidia.com/v1alpha1
Kind:         NodeAllocationState
Metadata:
  Creation Timestamp:  2023-12-04T14:29:09Z
  Generation:          4
  Owner References:
    API Version:     v1
    Kind:            Node
    Name:            k8s-dra-driver-cluster-worker
    UID:             ddb095d1-a608-4f70-a7b2-bc55ad81ed4c
  Resource Version:  587
  UID:               863efe97-f965-4f42-9816-88e5fc3bb860
Spec:
  Allocatable Devices:
    Gpu:
      Architecture:             Ampere
      Brand:                    Nvidia
      Cuda Compute Capability:  8.0
      Index:                    0
      Memory Bytes:             85899345920
      Mig Enabled:              true
      Product Name:             NVIDIA A100 80GB PCIe
      Uuid:                     GPU-1a9afbae-5932-54f8-c2c4-a863888d45bb
    Gpu:
      Architecture:             Ampere
      Brand:                    Nvidia
      Cuda Compute Capability:  8.0
      Index:                    1
      Memory Bytes:             85899345920
      Mig Enabled:              false
      Product Name:             NVIDIA A100 80GB PCIe
      Uuid:                     GPU-713eebac-08df-c534-6c98-8d5055ca97a9
    Mig:
      Parent Product Name:  NVIDIA A100 80GB PCIe
      Placements:
        Size:   1
        Start:  0
        Size:   1
        Start:  1
        Size:   1
        Start:  2
        Size:   1
        Start:  3
        Size:   1
        Start:  4
        Size:   1
        Start:  5
        Size:   1
        Start:  6
      Profile:  1g.10gb+me
    Mig:
      Parent Product Name:  NVIDIA A100 80GB PCIe
      Placements:
        Size:   2
        Start:  0
        Size:   2
        Start:  2
        Size:   2
        Start:  4
        Size:   2
        Start:  6
      Profile:  1g.20gb
    Mig:
      Parent Product Name:  NVIDIA A100 80GB PCIe
      Placements:
        Size:   1
        Start:  0
        Size:   1
        Start:  1
        Size:   1
        Start:  2
        Size:   1
        Start:  3
        Size:   1
        Start:  4
        Size:   1
        Start:  5
        Size:   1
        Start:  6
      Profile:  1g.10gb
    Mig:
      Parent Product Name:  NVIDIA A100 80GB PCIe
      Placements:
        Size:   2
        Start:  0
        Size:   2
        Start:  2
        Size:   2
        Start:  4
      Profile:  2g.20gb
    Mig:
      Parent Product Name:  NVIDIA A100 80GB PCIe
      Placements:
        Size:   4
        Start:  0
        Size:   4
        Start:  4
      Profile:  3g.40gb
    Mig:
      Parent Product Name:  NVIDIA A100 80GB PCIe
      Placements:
        Size:   4
        Start:  0
      Profile:  4g.40gb
    Mig:
      Parent Product Name:  NVIDIA A100 80GB PCIe
      Placements:
        Size:   8
        Start:  0
      Profile:  7g.80gb
Status:         Ready
Events:         <none>

As seen nvidia-smi and nas do not agree.

one thing to note is that docker exec -ti k8s-dra-driver-cluster-worker nvidia-smi takes a long time to execute about 12 seconds.

klueska commented 2 weeks ago

Can this be closed?