NVIDIA / k8s-dra-driver

Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes
Apache License 2.0
238 stars 43 forks source link

Lack of Detailed Documentation on MIG Configuration in Quickstart Guide #105

Closed anencore94 closed 5 months ago

anencore94 commented 5 months ago

Issue Description

The quickstart/README.md in the demo/specs/quickstart directory currently includes brief comments on setting up for a "real" deployment but lacks comprehensive details necessary for effective use, particularly regarding Multi-Instance GPU (MIG) configuration.

Affected Files

Specific Problems

  1. Insufficient Details for MIG Setup in Kind Cluster: The examples under demo/specs/quickstart (e.g., gpu-test4, gpu-test5, gpu-test6) require MIG to be enabled. However, there's no detailed instruction on how to configure the MIG-related options properly. Consequently, without the correct setup, deploying these examples results in failures.

Suggested Enhancements

  1. Expanded README with MIG Setup Instructions: It would be highly beneficial for users if the README includes a detailed section on setting up MIG in a kind cluster environment. This should cover:

    • Prerequisites for enabling MIG, including required tools and initial environment setup.
    • Step-by-step guide on configuring MIG using nvidia-mig-parted or similar tools.
    • Examples of commands and configurations that have been proven to work in similar setups.
    • Troubleshooting common issues that might arise during the setup.

Just applying the following doesn't work for my cluster


   cat <<EOF | sudo -E nvidia-mig-parted apply -f -
   version: v1
   mig-configs:
      half-half:
      - devices: [0,1,2,3]
         mig-enabled: false
      - devices: [4,5,6,7]
         mig-enabled: true
         mig-devices: {}
   EOF
klueska commented 5 months ago

We will get there eventually -- the current focus is on stabilizing the API for DRA in upstream Kubernetes so that we can eventually build a production ready version of this driver. Until we have that, everything here is just a POC.

That said, I'm happy to help debug your setup in the meantime and / or accept a PR that improves the (intermediate) documentation.

The mig-parted command you show should be sufficient to get the GPUs in a state that the DRA driver can work with. What problems are you running into exactly?

anencore94 commented 5 months ago

@klueska Hi, Thanks for the reply.

image

My situation is as follows:

  1. two different GPU models in kind cluster's worker node
  2. apply nvidia-mig-parted with config file from https://github.com/NVIDIA/mig-parted/blob/main/examples/config.yaml
    1. succeed with echo $? = 0, but it doesn't work
  3. Therefore, when I do kubectl apply -f gpu-test4.yaml, all pods, resourceclaims stay in Pending status, and there doesn't occur any mig instances by nvidia-smi in the worker node.

image

  1. Logs of dra-controller is this
I0503 06:20:08.644542       1 controller.go:383] "recheck periodically" logger="resource controller" key="schedulingCtx:gpu-test4/pod-6668856986-pr9h2"
I0503 06:20:09.235388       1 controller.go:373] "processing" logger="resource controller" key="schedulingCtx:gpu-test4/pod-6668856986-4gc5w"
I0503 06:20:09.237256       1 round_trippers.go:553] GET https://10.96.0.1:443/api/v1/namespaces/gpu-test4/pods/pod-6668856986-4gc5w 200 OK in 1 milliseconds
I0503 06:20:09.239376       1 round_trippers.go:553] GET https://10.96.0.1:443/apis/gpu.resource.nvidia.com/v1alpha1/namespaces/gpu-test4/gpuclaimparameters/mig-enabled-gpu 200 OK in 1 milliseconds
I0503 06:20:09.240835       1 round_trippers.go:553] GET https://10.96.0.1:443/apis/gpu.resource.nvidia.com/v1alpha1/namespaces/gpu-test4/migdeviceclaimparameters/mig-1g.5gb 200 OK in 1 milliseconds
I0503 06:20:09.242201       1 round_trippers.go:553] GET https://10.96.0.1:443/apis/gpu.resource.nvidia.com/v1alpha1/namespaces/gpu-test4/migdeviceclaimparameters/mig-1g.5gb 200 OK in 1 milliseconds
I0503 06:20:09.243472       1 round_trippers.go:553] GET https://10.96.0.1:443/apis/gpu.resource.nvidia.com/v1alpha1/namespaces/gpu-test4/migdeviceclaimparameters/mig-2g.10gb 200 OK in 1 milliseconds
I0503 06:20:09.244734       1 round_trippers.go:553] GET https://10.96.0.1:443/apis/gpu.resource.nvidia.com/v1alpha1/namespaces/gpu-test4/migdeviceclaimparameters/mig-3g.20gb 200 OK in 1 milliseconds
I0503 06:20:09.246121       1 round_trippers.go:553] GET https://10.96.0.1:443/apis/nas.gpu.resource.nvidia.com/v1alpha1/namespaces/nvidia-dra-driver/nodeallocationstates/k8s-dra-driver-cluster-worker 200 OK in 1 milliseconds
I0503 06:20:09.246356       1 controller.go:788] "pending pod claims" logger="resource controller" key="schedulingCtx:gpu-test4/pod-6668856986-4gc5w" claims=[{"PodClaimName":"mig-enabled-gpu","Claim":{"metadata":{"name":"pod-6668856986-4gc5w-mig-enabled-gpu-wrvng","generateName":"pod-6668856986-4gc5w-mig-enabled-gpu-","namespace":"gpu-test4","uid":"9e8ff80f-77c6-48a8-aa0f-a381565e3957","resourceVersion":"150555","creationTimestamp":"2024-05-03T06:19:34Z","annotations":{"resource.kubernetes.io/pod-claim-name":"mig-enabled-gpu"},"ownerReferences":[{"apiVersion":"v1","kind":"Pod","name":"pod-6668856986-4gc5w","uid":"9e728c75-d8a6-4dfc-879a-d211b137399b","controller":true,"blockOwnerDeletion":true}],"managedFields":[{"manager":"kube-controller-manager","operation":"Update","apiVersion":"resource.k8s.io/v1alpha2","time":"2024-05-03T06:19:34Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:resource.kubernetes.io/pod-claim-name":{}},"f:generateName":{},"f:ownerReferences":{".":{},"k:{\"uid\":\"9e728c75-d8a6-4dfc-879a-d211b137399b\"}":{}}},"f:spec":{"f:allocationMode":{},"f:parametersRef":{".":{},"f:apiGroup":{},"f:kind":{},"f:name":{}},"f:resourceClassName":{}}}}]},"spec":{"resourceClassName":"gpu.nvidia.com","parametersRef":{"apiGroup":"gpu.resource.nvidia.com","kind":"GpuClaimParameters","name":"mig-enabled-gpu"},"allocationMode":"WaitForFirstConsumer"},"status":{}},"Class":{"metadata":{"name":"gpu.nvidia.com","uid":"bc979b32-d295-4fea-9570-bd91fb1c9aac","resourceVersion":"712","creationTimestamp":"2024-05-02T01:16:07Z","labels":{"app.kubernetes.io/managed-by":"Helm"},"annotations":{"meta.helm.sh/release-name":"nvidia","meta.helm.sh/release-namespace":"nvidia-dra-driver"},"managedFields":[{"manager":"helm","operation":"Update","apiVersion":"resource.k8s.io/v1alpha2","time":"2024-05-02T01:16:07Z","fieldsType":"FieldsV1","fieldsV1":{"f:driverName":{},"f:metadata":{"f:annotations":{".":{},"f:meta.helm.sh/release-name":{},"f:meta.helm.sh/release-namespace":{}},"f:labels":{".":{},"f:app.kubernetes.io/managed-by":{}}}}}]},"driverName":"gpu.resource.nvidia.com"},"ClaimParameters":{"count":1,"selector":{"migEnabled":true}},"ClassParameters":{"sharable":true},"UnsuitableNodes":["k8s-dra-driver-cluster-worker"],"Allocation":null,"Error":null},{"PodClaimName":"mig-1g-0","Claim":{"metadata":{"name":"pod-6668856986-4gc5w-mig-1g-0-rxnm6","generateName":"pod-6668856986-4gc5w-mig-1g-0-","namespace":"gpu-test4","uid":"26cd27c0-e77d-4538-9478-ca2b135995dd","resourceVersion":"150561","creationTimestamp":"2024-05-03T06:19:34Z","annotations":{"resource.kubernetes.io/pod-claim-name":"mig-1g-0"},"ownerReferences":[{"apiVersion":"v1","kind":"Pod","name":"pod-6668856986-4gc5w","uid":"9e728c75-d8a6-4dfc-879a-d211b137399b","controller":true,"blockOwnerDeletion":true}],"managedFields":[{"manager":"kube-controller-manager","operation":"Update","apiVersion":"resource.k8s.io/v1alpha2","time":"2024-05-03T06:19:34Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:resource.kubernetes.io/pod-claim-name":{}},"f:generateName":{},"f:ownerReferences":{".":{},"k:{\"uid\":\"9e728c75-d8a6-4dfc-879a-d211b137399b\"}":{}}},"f:spec":{"f:allocationMode":{},"f:parametersRef":{".":{},"f:apiGroup":{},"f:kind":{},"f:name":{}},"f:resourceClassName":{}}}}]},"spec":{"resourceClassName":"gpu.nvidia.com","parametersRef":{"apiGroup":"gpu.resource.nvidia.com","kind":"MigDeviceClaimParameters","name":"mig-1g.5gb"},"allocationMode":"WaitForFirstConsumer"},"status":{}},"Class":{"metadata":{"name":"gpu.nvidia.com","uid":"bc979b32-d295-4fea-9570-bd91fb1c9aac","resourceVersion":"712","creationTimestamp":"2024-05-02T01:16:07Z","labels":{"app.kubernetes.io/managed-by":"Helm"},"annotations":{"meta.helm.sh/release-name":"nvidia","meta.helm.sh/release-namespace":"nvidia-dra-driver"},"managedFields":[{"manager":"helm","operation":"Update","apiVersion":"resource.k8s.io/v1alpha2","time":"2024-05-02T01:16:07Z","fieldsType":"FieldsV1","fieldsV1":{"f:driverName":{},"f:metadata":{"f:annotations":{".":{},"f:meta.helm.sh/release-name":{},"f:meta.helm.sh/release-namespace":{}},"f:labels":{".":{},"f:app.kubernetes.io/managed-by":{}}}}}]},"driverName":"gpu.resource.nvidia.com"},"ClaimParameters":{"profile":"1g.5gb","gpuClaimName":"mig-enabled-gpu"},"ClassParameters":{"sharable":true},"UnsuitableNodes":["k8s-dra-driver-cluster-worker"],"Allocation":null,"Error":null},{"PodClaimName":"mig-1g-1","Claim":{"metadata":{"name":"pod-6668856986-4gc5w-mig-1g-1-wd2xl","generateName":"pod-6668856986-4gc5w-mig-1g-1-","namespace":"gpu-test4","uid":"2ab30f3e-db73-4443-90a2-aa86a53f0a8f","resourceVersion":"150566","creationTimestamp":"2024-05-03T06:19:34Z","annotations":{"resource.kubernetes.io/pod-claim-name":"mig-1g-1"},"ownerReferences":[{"apiVersion":"v1","kind":"Pod","name":"pod-6668856986-4gc5w","uid":"9e728c75-d8a6-4dfc-879a-d211b137399b","controller":true,"blockOwnerDeletion":true}],"managedFields":[{"manager":"kube-controller-manager","operation":"Update","apiVersion":"resource.k8s.io/v1alpha2","time":"2024-05-03T06:19:34Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:resource.kubernetes.io/pod-claim-name":{}},"f:generateName":{},"f:ownerReferences":{".":{},"k:{\"uid\":\"9e728c75-d8a6-4dfc-879a-d211b137399b\"}":{}}},"f:spec":{"f:allocationMode":{},"f:parametersRef":{".":{},"f:apiGroup":{},"f:kind":{},"f:name":{}},"f:resourceClassName":{}}}}]},"spec":{"resourceClassName":"gpu.nvidia.com","parametersRef":{"apiGroup":"gpu.resource.nvidia.com","kind":"MigDeviceClaimParameters","name":"mig-1g.5gb"},"allocationMode":"WaitForFirstConsumer"},"status":{}},"Class":{"metadata":{"name":"gpu.nvidia.com","uid":"bc979b32-d295-4fea-9570-bd91fb1c9aac","resourceVersion":"712","creationTimestamp":"2024-05-02T01:16:07Z","labels":{"app.kubernetes.io/managed-by":"Helm"},"annotations":{"meta.helm.sh/release-name":"nvidia","meta.helm.sh/release-namespace":"nvidia-dra-driver"},"managedFields":[{"manager":"helm","operation":"Update","apiVersion":"resource.k8s.io/v1alpha2","time":"2024-05-02T01:16:07Z","fieldsType":"FieldsV1","fieldsV1":{"f:driverName":{},"f:metadata":{"f:annotations":{".":{},"f:meta.helm.sh/release-name":{},"f:meta.helm.sh/release-namespace":{}},"f:labels":{".":{},"f:app.kubernetes.io/managed-by":{}}}}}]},"driverName":"gpu.resource.nvidia.com"},"ClaimParameters":{"profile":"1g.5gb","gpuClaimName":"mig-enabled-gpu"},"ClassParameters":{"sharable":true},"UnsuitableNodes":["k8s-dra-driver-cluster-worker"],"Allocation":null,"Error":null},{"PodClaimName":"mig-2g","Claim":{"metadata":{"name":"pod-6668856986-4gc5w-mig-2g-45rpf","generateName":"pod-6668856986-4gc5w-mig-2g-","namespace":"gpu-test4","uid":"093a2591-040e-47ab-be4b-cf28947deb67","resourceVersion":"150570","creationTimestamp":"2024-05-03T06:19:34Z","annotations":{"resource.kubernetes.io/pod-claim-name":"mig-2g"},"ownerReferences":[{"apiVersion":"v1","kind":"Pod","name":"pod-6668856986-4gc5w","uid":"9e728c75-d8a6-4dfc-879a-d211b137399b","controller":true,"blockOwnerDeletion":true}],"managedFields":[{"manager":"kube-controller-manager","operation":"Update","apiVersion":"resource.k8s.io/v1alpha2","time":"2024-05-03T06:19:34Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:resource.kubernetes.io/pod-claim-name":{}},"f:generateName":{},"f:ownerReferences":{".":{},"k:{\"uid\":\"9e728c75-d8a6-4dfc-879a-d211b137399b\"}":{}}},"f:spec":{"f:allocationMode":{},"f:parametersRef":{".":{},"f:apiGroup":{},"f:kind":{},"f:name":{}},"f:resourceClassName":{}}}}]},"spec":{"resourceClassName":"gpu.nvidia.com","parametersRef":{"apiGroup":"gpu.resource.nvidia.com","kind":"MigDeviceClaimParameters","name":"mig-2g.10gb"},"allocationMode":"WaitForFirstConsumer"},"status":{}},"Class":{"metadata":{"name":"gpu.nvidia.com","uid":"bc979b32-d295-4fea-9570-bd91fb1c9aac","resourceVersion":"712","creationTimestamp":"2024-05-02T01:16:07Z","labels":{"app.kubernetes.io/managed-by":"Helm"},"annotations":{"meta.helm.sh/release-name":"nvidia","meta.helm.sh/release-namespace":"nvidia-dra-driver"},"managedFields":[{"manager":"helm","operation":"Update","apiVersion":"resource.k8s.io/v1alpha2","time":"2024-05-02T01:16:07Z","fieldsType":"FieldsV1","fieldsV1":{"f:driverName":{},"f:metadata":{"f:annotations":{".":{},"f:meta.helm.sh/release-name":{},"f:meta.helm.sh/release-namespace":{}},"f:labels":{".":{},"f:app.kubernetes.io/managed-by":{}}}}}]},"driverName":"gpu.resource.nvidia.com"},"ClaimParameters":{"profile":"2g.10gb","gpuClaimName":"mig-enabled-gpu"},"ClassParameters":{"sharable":true},"UnsuitableNodes":["k8s-dra-driver-cluster-worker"],"Allocation":null,"Error":null},{"PodClaimName":"mig-3g","Claim":{"metadata":{"name":"pod-6668856986-4gc5w-mig-3g-9mgnl","generateName":"pod-6668856986-4gc5w-mig-3g-","namespace":"gpu-test4","uid":"af9c1d79-8378-4fae-943b-e7a101286dfe","resourceVersion":"150574","creationTimestamp":"2024-05-03T06:19:34Z","annotations":{"resource.kubernetes.io/pod-claim-name":"mig-3g"},"ownerReferences":[{"apiVersion":"v1","kind":"Pod","name":"pod-6668856986-4gc5w","uid":"9e728c75-d8a6-4dfc-879a-d211b137399b","controller":true,"blockOwnerDeletion":true}],"managedFields":[{"manager":"kube-controller-manager","operation":"Update","apiVersion":"resource.k8s.io/v1alpha2","time":"2024-05-03T06:19:34Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:resource.kubernetes.io/pod-claim-name":{}},"f:generateName":{},"f:ownerReferences":{".":{},"k:{\"uid\":\"9e728c75-d8a6-4dfc-879a-d211b137399b\"}":{}}},"f:spec":{"f:allocationMode":{},"f:parametersRef":{".":{},"f:apiGroup":{},"f:kind":{},"f:name":{}},"f:resourceClassName":{}}}}]},"spec":{"resourceClassName":"gpu.nvidia.com","parametersRef":{"apiGroup":"gpu.resource.nvidia.com","kind":"MigDeviceClaimParameters","name":"mig-3g.20gb"},"allocationMode":"WaitForFirstConsumer"},"status":{}},"Class":{"metadata":{"name":"gpu.nvidia.com","uid":"bc979b32-d295-4fea-9570-bd91fb1c9aac","resourceVersion":"712","creationTimestamp":"2024-05-02T01:16:07Z","labels":{"app.kubernetes.io/managed-by":"Helm"},"annotations":{"meta.helm.sh/release-name":"nvidia","meta.helm.sh/release-namespace":"nvidia-dra-driver"},"managedFields":[{"manager":"helm","operation":"Update","apiVersion":"resource.k8s.io/v1alpha2","time":"2024-05-02T01:16:07Z","fieldsType":"FieldsV1","fieldsV1":{"f:driverName":{},"f:metadata":{"f:annotations":{".":{},"f:meta.helm.sh/release-name":{},"f:meta.helm.sh/release-namespace":{}},"f:labels":{".":{},"f:app.kubernetes.io/managed-by":{}}}}}]},"driverName":"gpu.resource.nvidia.com"},"ClaimParameters":{"profile":"3g.20gb","gpuClaimName":"mig-enabled-gpu"},"ClassParameters":{"sharable":true},"UnsuitableNodes":["k8s-dra-driver-cluster-worker"],"Allocation":null,"Error":null}] selectedNode=""
I0503 06:20:09.246386       1 controller.go:383] "recheck periodically" logger="resource controller" key="schedulingCtx:gpu-test4/pod-6668856986-4gc5w"
klueska commented 5 months ago

I believe the 1g.5gb profile is not available on the RTX (they have profiles of different sizes), which one can query with:

nvidia-smi mig -lgip

The mig-parted command should probably not return success if an invalid profile is passed, but currently it just skips any GPUs where the profile is unavailable (so if none of the GPUs have it available it will just return success, (indicating that all GPUs supporting that profile were able to be updated successfully, which in this case is 0).

All of that said -- for DRA you should only apply the all-enabled config and nothing else. MIG device creation is done dynamically as requests come in, rather than preconfigured apriori.

anencore94 commented 5 months ago

Thanks for the fast reply 🙏 However, also the all-enabled config is not applied I believe. image

Even if the all-enabled config is with value mig-enabled: true as follows: image

klueska commented 5 months ago

Are your GPUs mig capable? What does the output of nvidia-smi mig -lgip show in terms of available MIG profiles?

anencore94 commented 5 months ago

Are your GPUs mig capable? What does the output of nvidia-smi mig -lgip show in terms of available MIG profiles?

Thanks for the check. As you can see in the above screenshots, there wasn't any MIG-enabled devices. And nvidia-smi -mig 1 doesn't work on my VMs. So maybe it is not MIG-capable in my VM GPUs. I'll check it out. Thanks again.