NVIDIA / k8s-dra-driver

Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes
Apache License 2.0
212 stars 38 forks source link

Immediate mode allocation failing #47

Closed asm582 closed 7 months ago

asm582 commented 7 months ago

Hello,

I have taken a stab at implementing immediate mode but I get the below message on A30 host:

[root@e28-h13-r750 quickstart]# kubectl describe resourceclaim immediate-claim -n gpu-test-immediate
Name:         immediate-claim
Namespace:    gpu-test-immediate
Labels:       <none>
Annotations:  <none>
API Version:  resource.k8s.io/v1alpha2
Kind:         ResourceClaim
Metadata:
  Creation Timestamp:  2024-01-03T16:52:01Z
  Finalizers:
    gpu.resource.nvidia.com/deletion-protection
  Resource Version:  882
  UID:               be93055c-5826-40bd-abfa-2411eb4cb0a3
Spec:
  Allocation Mode:      Immediate
  Resource Class Name:  gpu.nvidia.com
Status:
Events:
  Type     Reason  Age               From                                     Message
  ----     ------  ----              ----                                     -------
  Warning  Failed  4s (x11 over 9s)  resource driver gpu.resource.nvidia.com  allocate: error performing immediate allocation: updating NodeAllocationState CRD: NodeAllocationState.nas.gpu.resource.nvidia.com "k8s-dra-driver-cluster-worker" is invalid: spec.allocatedClaims.be93055c-5826-40bd-abfa-2411eb4cb0a3.claimInfo: Required value
[root@e28-h13-r750 quickstart]# kubectl describe nas k8s-dra-driver-cluster-worker -n nvidia-dra-driver
Name:         k8s-dra-driver-cluster-worker
Namespace:    nvidia-dra-driver
Labels:       <none>
Annotations:  <none>
API Version:  nas.gpu.resource.nvidia.com/v1alpha1
Kind:         NodeAllocationState
Metadata:
  Creation Timestamp:  2024-01-03T16:51:09Z
  Generation:          4
  Owner References:
    API Version:     v1
    Kind:            Node
    Name:            k8s-dra-driver-cluster-worker
    UID:             1cedb2bf-c49d-49b7-aa4d-935df065ef5e
  Resource Version:  810
  UID:               969d2469-db63-4f73-aa96-7fabadafb404
Spec:
  Allocatable Devices:
    Gpu:
      Architecture:             Ampere
      Brand:                    Nvidia
      Cuda Compute Capability:  8.0
      Index:                    0
      Memory Bytes:             25769803776
      Mig Enabled:              false
      Product Name:             NVIDIA A30
      Uuid:                     GPU-6916654b-79f0-c527-a2c4-f977dab85b91
Status:                         Ready
Events:                         <none>

Any pointers what could be the issue?

asm582 commented 7 months ago

I was trying to set the node name to allocate the claim with a client that was not for Nvidia GPUs which caused the above issue. Closing this issue for now.