NVIDIA / k8s-dra-driver

Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes
Apache License 2.0
226 stars 41 forks source link

Address issues with running DRA driver on GKE #23

Closed elezar closed 9 months ago

elezar commented 9 months ago

These changes allow the DRA driver to run on GKE (which supports enabling alpha features).

They include:

Running the included demo with the image ghcr.io/nvidia/k8s-dra-driver:e4a95c14-ubuntu20.04:

 kubectl get nodeallocationstates.nas.gpu.resource.nvidia.com -A -o=json \
        | jq -r '.items[]
             | select(.spec.allocatedClaims)
             | "\(.metadata.name):",
             (.spec.allocatedClaims[])'

gke-k8s-dra-driver-cluster-pool-1-82ea8e4c-02x4:
{
  "claimInfo": {
    "name": "inference-pod-gpu",
    "namespace": "kubecon-demo",
    "uid": "39f74f77-98c4-44bc-9c46-125e5700a60a"
  },
  "gpu": {
    "devices": [
      {
        "uuid": "GPU-679f75dd-b95c-ef89-d691-f3c4f523d43b"
      }
    ]
  }
}
gke-k8s-dra-driver-cluster-pool-2-2afcfde8-86sd:
{
  "claimInfo": {
    "name": "training-pod-gpu",
    "namespace": "kubecon-demo",
    "uid": "5ca7c648-ec6a-4228-ac05-6af75dc36d4c"
  },
  "gpu": {
    "devices": [
      {
        "uuid": "GPU-66e80786-6056-af90-1aaa-10aebff0155a"
      }
    ]
  }
}

and

➜  k8s-dra-driver git:(dra-on-gke) ✗ kubectl logs -n kubecon-demo inference-pod
GPU 0: Tesla T4 (UUID: GPU-679f75dd-b95c-ef89-d691-f3c4f523d43b)
➜  k8s-dra-driver git:(dra-on-gke) ✗ kubectl logs -n kubecon-demo training-pod
GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-66e80786-6056-af90-1aaa-10aebff0155a)