NVIDIA / k8s-dra-driver

Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes
Apache License 2.0
251 stars 47 forks source link

Allocation mode Immediate does not work #20

Closed asm582 closed 9 months ago

asm582 commented 11 months ago

We are trying to use allocation mode Immediate but it does not work, we see claims created:

Name:         gpu.example.com
Namespace:    gpu-test1
Labels:       <none>
Annotations:  <none>
API Version:  resource.k8s.io/v1alpha2
Kind:         ResourceClaimTemplate
Metadata:
  Creation Timestamp:  2023-11-16T16:34:00Z
  Resource Version:    4614
  UID:                 0d57888b-bdce-4633-8326-dc81898f6f43
Spec:
  Metadata:
    Creation Timestamp:  <nil>
  Spec:
    Allocation Mode:      Immediate
    Resource Class Name:  gpu.example.com

but claims are not generated on the node:

Name:         dra-example-driver-cluster-worker
Namespace:    dra-example-driver
Labels:       <none>
Annotations:  <none>
API Version:  nas.gpu.resource.example.com/v1alpha1
Kind:         NodeAllocationState
Metadata:
  Creation Timestamp:  2023-11-16T15:54:37Z
  Generation:          79
  Owner References:
    API Version:     v1
    Kind:            Node
    Name:            dra-example-driver-cluster-worker
    UID:             0d210c9c-da1b-4fad-afca-e7369d6a5851
  Resource Version:  15633
  UID:               1fd486d6-e754-47dc-bb4c-0392f61b3c05
Spec:
  Allocatable Devices:
    Gpu:
      Product Name:  LATEST-GPU-MODEL
      Uuid:          GPU-e7b42cb1-4fd8-91b2-bc77-352a0c1f5747
    Gpu:
      Product Name:  LATEST-GPU-MODEL
      Uuid:          GPU-f11773a1-5bfb-e48b-3d98-1beb5baaf08e
    Gpu:
      Product Name:  LATEST-GPU-MODEL
      Uuid:          GPU-0159f35e-99ee-b2b5-74f1-9d18df3f22ac
    Gpu:
      Product Name:  LATEST-GPU-MODEL
      Uuid:          GPU-657bd2e7-f5c2-a7f2-fbaa-0d1cdc32f81b
    Gpu:
      Product Name:  LATEST-GPU-MODEL
      Uuid:          GPU-18db0e85-99e9-c746-8531-ffeb86328b39
    Gpu:
      Product Name:  LATEST-GPU-MODEL
      Uuid:          GPU-93d37703-997c-c46f-a531-755e3e0dc2ac
    Gpu:
      Product Name:  LATEST-GPU-MODEL
      Uuid:          GPU-ee3e4b55-fcda-44b8-0605-64b7a9967744
    Gpu:
      Product Name:  LATEST-GPU-MODEL
      Uuid:          GPU-9ede7e32-5825-a11b-fa3d-bab6d47e0243
Status:              Ready
Events:              <none>
klueska commented 11 months ago

The resource class is wrong: gpu.example.com it should be gpu.nvidia.com

asm582 commented 11 months ago

The resource class is wrong: gpu.example.com it should be gpu.nvidia.com

Thanks @klueska as seen I am running an example driver with simulated GPUs. are you saying immediate mode only works with real GPUs?

elezar commented 11 months ago

The resource class is wrong: gpu.example.com it should be gpu.nvidia.com

Thanks @klueska as seen I am running an example driver with simulated GPUs. are you saying immediate mode only works with real GPUs?

Are you refering to the https://github.com/kubernetes-sigs/dra-example-driver? If so, we should migrate this issue there instead. This repository is for the NVIDIA GPU-specific DRA driver implementation.

klueska commented 11 months ago

Support is not yet merged for it in the example driver. See https://github.com/kubernetes-sigs/dra-example-driver/pull/4

In any case, I got confused because (as Evan said) you opened the issue against this repo, rather than the example driver repo (so i assumed you were using the NVIDIA DRA driver rather than the example one).

asm582 commented 11 months ago

Sorry for the confusion, the reason I raised the issue here is that I saw this logline:

https://github.com/NVIDIA/k8s-dra-driver/blob/4fda7feab5afe75a8a0f8432e92549ca7852572d/cmd/nvidia-dra-controller/driver.go#L111

If we think immediate mode works I can certainly move the issue to the desired repository, thanks

asm582 commented 11 months ago

Hello, we tried this on real nodes and got the below status when exercising claims in Immediate mode :

[root@nvd-srv-02 k8s-dra-driver]# kubectl describe resourceclaim gpu.nvidia.com -n gpu-test1
Name:         gpu.nvidia.com
Namespace:    gpu-test1
Labels:       <none>
Annotations:  <none>
API Version:  resource.k8s.io/v1alpha2
Kind:         ResourceClaim
Metadata:
  Creation Timestamp:  2023-11-29T17:54:02Z
  Finalizers:
    gpu.resource.nvidia.com/deletion-protection
  Resource Version:  7898
  UID:               066b4c8f-a174-45eb-a1b7-9b4ad78a0f17
Spec:
  Allocation Mode:      Immediate
  Resource Class Name:  gpu.nvidia.com
Status:
Events:
  Type     Reason  Age                 From                                     Message
  ----     ------  ----                ----                                     -------
  Warning  Failed  21s (x14 over 62s)  resource driver gpu.resource.nvidia.com  allocate: TODO: immediate allocations not yet supported

could you please share what we are missing?

klueska commented 11 months ago

You aren't missing anything:

allocate: TODO: immediate allocations not yet supported

We haven't added support for immediate mode yet

asm582 commented 10 months ago

Thanks, Do we know when will immediate mode be supported in Nvidia's DRA driver implementation?

asm582 commented 10 months ago

Ping! Can we request a roadmap for features that are planned for Nvidia's DRA implementation, for our use case we see Allocation mode as an important feature.

klueska commented 10 months ago

There is no concrete roadmap at the moment. Rapid development on this driver has been paused due to the issues that have come up with getting DRA promoted to beta upstream. All efforts have been shifted to ensuring this happens in as timely a manner as possible. We will, of course, continue to develop this driver, but it is more important to ensure that DRA happens at all, than to keep adding features here.