Closed anencore94 closed 5 months ago
We will get there eventually -- the current focus is on stabilizing the API for DRA in upstream Kubernetes so that we can eventually build a production ready version of this driver. Until we have that, everything here is just a POC.
That said, I'm happy to help debug your setup in the meantime and / or accept a PR that improves the (intermediate) documentation.
The mig-parted
command you show should be sufficient to get the GPUs in a state that the DRA driver can work with. What problems are you running into exactly?
@klueska Hi, Thanks for the reply.
My situation is as follows:
echo $? = 0
, but it doesn't workkubectl apply -f gpu-test4.yaml
, all pods, resourceclaims stay in Pending status, and there doesn't occur any mig instances by nvidia-smi
in the worker node.I0503 06:20:08.644542 1 controller.go:383] "recheck periodically" logger="resource controller" key="schedulingCtx:gpu-test4/pod-6668856986-pr9h2"
I0503 06:20:09.235388 1 controller.go:373] "processing" logger="resource controller" key="schedulingCtx:gpu-test4/pod-6668856986-4gc5w"
I0503 06:20:09.237256 1 round_trippers.go:553] GET https://10.96.0.1:443/api/v1/namespaces/gpu-test4/pods/pod-6668856986-4gc5w 200 OK in 1 milliseconds
I0503 06:20:09.239376 1 round_trippers.go:553] GET https://10.96.0.1:443/apis/gpu.resource.nvidia.com/v1alpha1/namespaces/gpu-test4/gpuclaimparameters/mig-enabled-gpu 200 OK in 1 milliseconds
I0503 06:20:09.240835 1 round_trippers.go:553] GET https://10.96.0.1:443/apis/gpu.resource.nvidia.com/v1alpha1/namespaces/gpu-test4/migdeviceclaimparameters/mig-1g.5gb 200 OK in 1 milliseconds
I0503 06:20:09.242201 1 round_trippers.go:553] GET https://10.96.0.1:443/apis/gpu.resource.nvidia.com/v1alpha1/namespaces/gpu-test4/migdeviceclaimparameters/mig-1g.5gb 200 OK in 1 milliseconds
I0503 06:20:09.243472 1 round_trippers.go:553] GET https://10.96.0.1:443/apis/gpu.resource.nvidia.com/v1alpha1/namespaces/gpu-test4/migdeviceclaimparameters/mig-2g.10gb 200 OK in 1 milliseconds
I0503 06:20:09.244734 1 round_trippers.go:553] GET https://10.96.0.1:443/apis/gpu.resource.nvidia.com/v1alpha1/namespaces/gpu-test4/migdeviceclaimparameters/mig-3g.20gb 200 OK in 1 milliseconds
I0503 06:20:09.246121 1 round_trippers.go:553] GET https://10.96.0.1:443/apis/nas.gpu.resource.nvidia.com/v1alpha1/namespaces/nvidia-dra-driver/nodeallocationstates/k8s-dra-driver-cluster-worker 200 OK in 1 milliseconds
I0503 06:20:09.246356 1 controller.go:788] "pending pod claims" logger="resource controller" key="schedulingCtx:gpu-test4/pod-6668856986-4gc5w" claims=[{"PodClaimName":"mig-enabled-gpu","Claim":{"metadata":{"name":"pod-6668856986-4gc5w-mig-enabled-gpu-wrvng","generateName":"pod-6668856986-4gc5w-mig-enabled-gpu-","namespace":"gpu-test4","uid":"9e8ff80f-77c6-48a8-aa0f-a381565e3957","resourceVersion":"150555","creationTimestamp":"2024-05-03T06:19:34Z","annotations":{"resource.kubernetes.io/pod-claim-name":"mig-enabled-gpu"},"ownerReferences":[{"apiVersion":"v1","kind":"Pod","name":"pod-6668856986-4gc5w","uid":"9e728c75-d8a6-4dfc-879a-d211b137399b","controller":true,"blockOwnerDeletion":true}],"managedFields":[{"manager":"kube-controller-manager","operation":"Update","apiVersion":"resource.k8s.io/v1alpha2","time":"2024-05-03T06:19:34Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:resource.kubernetes.io/pod-claim-name":{}},"f:generateName":{},"f:ownerReferences":{".":{},"k:{\"uid\":\"9e728c75-d8a6-4dfc-879a-d211b137399b\"}":{}}},"f:spec":{"f:allocationMode":{},"f:parametersRef":{".":{},"f:apiGroup":{},"f:kind":{},"f:name":{}},"f:resourceClassName":{}}}}]},"spec":{"resourceClassName":"gpu.nvidia.com","parametersRef":{"apiGroup":"gpu.resource.nvidia.com","kind":"GpuClaimParameters","name":"mig-enabled-gpu"},"allocationMode":"WaitForFirstConsumer"},"status":{}},"Class":{"metadata":{"name":"gpu.nvidia.com","uid":"bc979b32-d295-4fea-9570-bd91fb1c9aac","resourceVersion":"712","creationTimestamp":"2024-05-02T01:16:07Z","labels":{"app.kubernetes.io/managed-by":"Helm"},"annotations":{"meta.helm.sh/release-name":"nvidia","meta.helm.sh/release-namespace":"nvidia-dra-driver"},"managedFields":[{"manager":"helm","operation":"Update","apiVersion":"resource.k8s.io/v1alpha2","time":"2024-05-02T01:16:07Z","fieldsType":"FieldsV1","fieldsV1":{"f:driverName":{},"f:metadata":{"f:annotations":{".":{},"f:meta.helm.sh/release-name":{},"f:meta.helm.sh/release-namespace":{}},"f:labels":{".":{},"f:app.kubernetes.io/managed-by":{}}}}}]},"driverName":"gpu.resource.nvidia.com"},"ClaimParameters":{"count":1,"selector":{"migEnabled":true}},"ClassParameters":{"sharable":true},"UnsuitableNodes":["k8s-dra-driver-cluster-worker"],"Allocation":null,"Error":null},{"PodClaimName":"mig-1g-0","Claim":{"metadata":{"name":"pod-6668856986-4gc5w-mig-1g-0-rxnm6","generateName":"pod-6668856986-4gc5w-mig-1g-0-","namespace":"gpu-test4","uid":"26cd27c0-e77d-4538-9478-ca2b135995dd","resourceVersion":"150561","creationTimestamp":"2024-05-03T06:19:34Z","annotations":{"resource.kubernetes.io/pod-claim-name":"mig-1g-0"},"ownerReferences":[{"apiVersion":"v1","kind":"Pod","name":"pod-6668856986-4gc5w","uid":"9e728c75-d8a6-4dfc-879a-d211b137399b","controller":true,"blockOwnerDeletion":true}],"managedFields":[{"manager":"kube-controller-manager","operation":"Update","apiVersion":"resource.k8s.io/v1alpha2","time":"2024-05-03T06:19:34Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:resource.kubernetes.io/pod-claim-name":{}},"f:generateName":{},"f:ownerReferences":{".":{},"k:{\"uid\":\"9e728c75-d8a6-4dfc-879a-d211b137399b\"}":{}}},"f:spec":{"f:allocationMode":{},"f:parametersRef":{".":{},"f:apiGroup":{},"f:kind":{},"f:name":{}},"f:resourceClassName":{}}}}]},"spec":{"resourceClassName":"gpu.nvidia.com","parametersRef":{"apiGroup":"gpu.resource.nvidia.com","kind":"MigDeviceClaimParameters","name":"mig-1g.5gb"},"allocationMode":"WaitForFirstConsumer"},"status":{}},"Class":{"metadata":{"name":"gpu.nvidia.com","uid":"bc979b32-d295-4fea-9570-bd91fb1c9aac","resourceVersion":"712","creationTimestamp":"2024-05-02T01:16:07Z","labels":{"app.kubernetes.io/managed-by":"Helm"},"annotations":{"meta.helm.sh/release-name":"nvidia","meta.helm.sh/release-namespace":"nvidia-dra-driver"},"managedFields":[{"manager":"helm","operation":"Update","apiVersion":"resource.k8s.io/v1alpha2","time":"2024-05-02T01:16:07Z","fieldsType":"FieldsV1","fieldsV1":{"f:driverName":{},"f:metadata":{"f:annotations":{".":{},"f:meta.helm.sh/release-name":{},"f:meta.helm.sh/release-namespace":{}},"f:labels":{".":{},"f:app.kubernetes.io/managed-by":{}}}}}]},"driverName":"gpu.resource.nvidia.com"},"ClaimParameters":{"profile":"1g.5gb","gpuClaimName":"mig-enabled-gpu"},"ClassParameters":{"sharable":true},"UnsuitableNodes":["k8s-dra-driver-cluster-worker"],"Allocation":null,"Error":null},{"PodClaimName":"mig-1g-1","Claim":{"metadata":{"name":"pod-6668856986-4gc5w-mig-1g-1-wd2xl","generateName":"pod-6668856986-4gc5w-mig-1g-1-","namespace":"gpu-test4","uid":"2ab30f3e-db73-4443-90a2-aa86a53f0a8f","resourceVersion":"150566","creationTimestamp":"2024-05-03T06:19:34Z","annotations":{"resource.kubernetes.io/pod-claim-name":"mig-1g-1"},"ownerReferences":[{"apiVersion":"v1","kind":"Pod","name":"pod-6668856986-4gc5w","uid":"9e728c75-d8a6-4dfc-879a-d211b137399b","controller":true,"blockOwnerDeletion":true}],"managedFields":[{"manager":"kube-controller-manager","operation":"Update","apiVersion":"resource.k8s.io/v1alpha2","time":"2024-05-03T06:19:34Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:resource.kubernetes.io/pod-claim-name":{}},"f:generateName":{},"f:ownerReferences":{".":{},"k:{\"uid\":\"9e728c75-d8a6-4dfc-879a-d211b137399b\"}":{}}},"f:spec":{"f:allocationMode":{},"f:parametersRef":{".":{},"f:apiGroup":{},"f:kind":{},"f:name":{}},"f:resourceClassName":{}}}}]},"spec":{"resourceClassName":"gpu.nvidia.com","parametersRef":{"apiGroup":"gpu.resource.nvidia.com","kind":"MigDeviceClaimParameters","name":"mig-1g.5gb"},"allocationMode":"WaitForFirstConsumer"},"status":{}},"Class":{"metadata":{"name":"gpu.nvidia.com","uid":"bc979b32-d295-4fea-9570-bd91fb1c9aac","resourceVersion":"712","creationTimestamp":"2024-05-02T01:16:07Z","labels":{"app.kubernetes.io/managed-by":"Helm"},"annotations":{"meta.helm.sh/release-name":"nvidia","meta.helm.sh/release-namespace":"nvidia-dra-driver"},"managedFields":[{"manager":"helm","operation":"Update","apiVersion":"resource.k8s.io/v1alpha2","time":"2024-05-02T01:16:07Z","fieldsType":"FieldsV1","fieldsV1":{"f:driverName":{},"f:metadata":{"f:annotations":{".":{},"f:meta.helm.sh/release-name":{},"f:meta.helm.sh/release-namespace":{}},"f:labels":{".":{},"f:app.kubernetes.io/managed-by":{}}}}}]},"driverName":"gpu.resource.nvidia.com"},"ClaimParameters":{"profile":"1g.5gb","gpuClaimName":"mig-enabled-gpu"},"ClassParameters":{"sharable":true},"UnsuitableNodes":["k8s-dra-driver-cluster-worker"],"Allocation":null,"Error":null},{"PodClaimName":"mig-2g","Claim":{"metadata":{"name":"pod-6668856986-4gc5w-mig-2g-45rpf","generateName":"pod-6668856986-4gc5w-mig-2g-","namespace":"gpu-test4","uid":"093a2591-040e-47ab-be4b-cf28947deb67","resourceVersion":"150570","creationTimestamp":"2024-05-03T06:19:34Z","annotations":{"resource.kubernetes.io/pod-claim-name":"mig-2g"},"ownerReferences":[{"apiVersion":"v1","kind":"Pod","name":"pod-6668856986-4gc5w","uid":"9e728c75-d8a6-4dfc-879a-d211b137399b","controller":true,"blockOwnerDeletion":true}],"managedFields":[{"manager":"kube-controller-manager","operation":"Update","apiVersion":"resource.k8s.io/v1alpha2","time":"2024-05-03T06:19:34Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:resource.kubernetes.io/pod-claim-name":{}},"f:generateName":{},"f:ownerReferences":{".":{},"k:{\"uid\":\"9e728c75-d8a6-4dfc-879a-d211b137399b\"}":{}}},"f:spec":{"f:allocationMode":{},"f:parametersRef":{".":{},"f:apiGroup":{},"f:kind":{},"f:name":{}},"f:resourceClassName":{}}}}]},"spec":{"resourceClassName":"gpu.nvidia.com","parametersRef":{"apiGroup":"gpu.resource.nvidia.com","kind":"MigDeviceClaimParameters","name":"mig-2g.10gb"},"allocationMode":"WaitForFirstConsumer"},"status":{}},"Class":{"metadata":{"name":"gpu.nvidia.com","uid":"bc979b32-d295-4fea-9570-bd91fb1c9aac","resourceVersion":"712","creationTimestamp":"2024-05-02T01:16:07Z","labels":{"app.kubernetes.io/managed-by":"Helm"},"annotations":{"meta.helm.sh/release-name":"nvidia","meta.helm.sh/release-namespace":"nvidia-dra-driver"},"managedFields":[{"manager":"helm","operation":"Update","apiVersion":"resource.k8s.io/v1alpha2","time":"2024-05-02T01:16:07Z","fieldsType":"FieldsV1","fieldsV1":{"f:driverName":{},"f:metadata":{"f:annotations":{".":{},"f:meta.helm.sh/release-name":{},"f:meta.helm.sh/release-namespace":{}},"f:labels":{".":{},"f:app.kubernetes.io/managed-by":{}}}}}]},"driverName":"gpu.resource.nvidia.com"},"ClaimParameters":{"profile":"2g.10gb","gpuClaimName":"mig-enabled-gpu"},"ClassParameters":{"sharable":true},"UnsuitableNodes":["k8s-dra-driver-cluster-worker"],"Allocation":null,"Error":null},{"PodClaimName":"mig-3g","Claim":{"metadata":{"name":"pod-6668856986-4gc5w-mig-3g-9mgnl","generateName":"pod-6668856986-4gc5w-mig-3g-","namespace":"gpu-test4","uid":"af9c1d79-8378-4fae-943b-e7a101286dfe","resourceVersion":"150574","creationTimestamp":"2024-05-03T06:19:34Z","annotations":{"resource.kubernetes.io/pod-claim-name":"mig-3g"},"ownerReferences":[{"apiVersion":"v1","kind":"Pod","name":"pod-6668856986-4gc5w","uid":"9e728c75-d8a6-4dfc-879a-d211b137399b","controller":true,"blockOwnerDeletion":true}],"managedFields":[{"manager":"kube-controller-manager","operation":"Update","apiVersion":"resource.k8s.io/v1alpha2","time":"2024-05-03T06:19:34Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:resource.kubernetes.io/pod-claim-name":{}},"f:generateName":{},"f:ownerReferences":{".":{},"k:{\"uid\":\"9e728c75-d8a6-4dfc-879a-d211b137399b\"}":{}}},"f:spec":{"f:allocationMode":{},"f:parametersRef":{".":{},"f:apiGroup":{},"f:kind":{},"f:name":{}},"f:resourceClassName":{}}}}]},"spec":{"resourceClassName":"gpu.nvidia.com","parametersRef":{"apiGroup":"gpu.resource.nvidia.com","kind":"MigDeviceClaimParameters","name":"mig-3g.20gb"},"allocationMode":"WaitForFirstConsumer"},"status":{}},"Class":{"metadata":{"name":"gpu.nvidia.com","uid":"bc979b32-d295-4fea-9570-bd91fb1c9aac","resourceVersion":"712","creationTimestamp":"2024-05-02T01:16:07Z","labels":{"app.kubernetes.io/managed-by":"Helm"},"annotations":{"meta.helm.sh/release-name":"nvidia","meta.helm.sh/release-namespace":"nvidia-dra-driver"},"managedFields":[{"manager":"helm","operation":"Update","apiVersion":"resource.k8s.io/v1alpha2","time":"2024-05-02T01:16:07Z","fieldsType":"FieldsV1","fieldsV1":{"f:driverName":{},"f:metadata":{"f:annotations":{".":{},"f:meta.helm.sh/release-name":{},"f:meta.helm.sh/release-namespace":{}},"f:labels":{".":{},"f:app.kubernetes.io/managed-by":{}}}}}]},"driverName":"gpu.resource.nvidia.com"},"ClaimParameters":{"profile":"3g.20gb","gpuClaimName":"mig-enabled-gpu"},"ClassParameters":{"sharable":true},"UnsuitableNodes":["k8s-dra-driver-cluster-worker"],"Allocation":null,"Error":null}] selectedNode=""
I0503 06:20:09.246386 1 controller.go:383] "recheck periodically" logger="resource controller" key="schedulingCtx:gpu-test4/pod-6668856986-4gc5w"
I believe the 1g.5gb
profile is not available on the RTX (they have profiles of different sizes), which one can query with:
nvidia-smi mig -lgip
The mig-parted
command should probably not return success if an invalid profile is passed, but currently it just skips any GPUs where the profile is unavailable (so if none of the GPUs have it available it will just return success, (indicating that all GPUs supporting that profile were able to be updated successfully, which in this case is 0).
All of that said -- for DRA you should only apply the all-enabled
config and nothing else. MIG device creation is done dynamically as requests come in, rather than preconfigured apriori.
Thanks for the fast reply 🙏
However, also the all-enabled
config is not applied I believe.
Even if the all-enabled
config is with value mig-enabled: true
as follows:
Are your GPUs mig capable? What does the output of nvidia-smi mig -lgip
show in terms of available MIG profiles?
Are your GPUs mig capable? What does the output of
nvidia-smi mig -lgip
show in terms of available MIG profiles?
Thanks for the check. As you can see in the above screenshots, there wasn't any MIG-enabled devices. And nvidia-smi -mig 1
doesn't work on my VMs. So maybe it is not MIG-capable in my VM GPUs.
I'll check it out. Thanks again.
Issue Description
The
quickstart/README.md
in thedemo/specs/quickstart
directory currently includes brief comments on setting up for a "real" deployment but lacks comprehensive details necessary for effective use, particularly regarding Multi-Instance GPU (MIG) configuration.Affected Files
demo/specs/quickstart/README.md
Specific Problems
demo/specs/quickstart
(e.g.,gpu-test4
,gpu-test5
,gpu-test6
) require MIG to be enabled. However, there's no detailed instruction on how to configure the MIG-related options properly. Consequently, without the correct setup, deploying these examples results in failures.Suggested Enhancements
Expanded README with MIG Setup Instructions: It would be highly beneficial for users if the README includes a detailed section on setting up MIG in a kind cluster environment. This should cover:
nvidia-mig-parted
or similar tools.Just applying the following doesn't work for my cluster