Closed asm582 closed 9 months ago
As you can see from the list of Profiles that are available 1g.5gb
is not a valid profile name. Since you are using an 80GB card, the equivalent profile (from a compute perspective) is 1g.10gb
.
Thanks I changed the sample to be as below, but still no luck. any pointers?
---
apiVersion: v1
kind: Namespace
metadata:
name: gpu-test1
---
apiVersion: gpu.resource.nvidia.com/v1alpha1
kind: GpuClaimParameters
metadata:
namespace: gpu-test1
name: mig-enabled-gpu
spec:
count: 1
selector:
migEnabled: true
---
apiVersion: gpu.resource.nvidia.com/v1alpha1
kind: MigDeviceClaimParameters
metadata:
namespace: gpu-test1
name: mig-1g.10gb
spec:
profile: "1g.10gb"
gpuClaimName: "mig-enabled-gpu"
---
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaimTemplate
metadata:
namespace: gpu-test1
name: mig-1g.10gb
spec:
spec:
resourceClassName: gpu.nvidia.com
parametersRef:
apiGroup: gpu.resource.nvidia.com
kind: MigDeviceClaimParameters
name: mig-1g.10gb
---
apiVersion: v1
kind: Pod
metadata:
namespace: gpu-test1
name: pod1
labels:
app: pod
spec:
resourceClaims:
- name: mig1g
source:
resourceClaimTemplateName: mig-1g.10gb
containers:
- name: ctr
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["nvidia-smi -L; sleep 9999"]
resources:
claims:
- name: mig1g
[root@nvd-srv-02 k8s-dra-driver]# kubectl describe resourceclaim pod1-mig1g -n gpu-test1
Name: pod1-mig1g
Namespace: gpu-test1
Labels: <none>
Annotations: <none>
API Version: resource.k8s.io/v1alpha2
Kind: ResourceClaim
Metadata:
Creation Timestamp: 2023-12-04T14:34:36Z
Owner References:
API Version: v1
Block Owner Deletion: true
Controller: true
Kind: Pod
Name: pod1
UID: 2c951092-40e6-4df7-994d-e8af41244e9a
Resource Version: 1057
UID: fb5c03ed-8899-4919-bb01-6dcaf05d4c8f
Spec:
Allocation Mode: WaitForFirstConsumer
Parameters Ref:
API Group: gpu.resource.nvidia.com
Kind: MigDeviceClaimParameters
Name: mig-1g.10gb
Resource Class Name: gpu.nvidia.com
Status:
Events: <none>
[root@nvd-srv-02 k8s-dra-driver]# kubectl apply -f /root/k8s-dra-driver/demo/specs/quickstart/gpu-test1-1sliceperpod.yaml
namespace/gpu-test1 created
gpuclaimparameters.gpu.resource.nvidia.com/mig-enabled-gpu created
migdeviceclaimparameters.gpu.resource.nvidia.com/mig-1g.10gb created
resourceclaimtemplate.resource.k8s.io/mig-1g.10gb created
pod/pod1 created
[root@nvd-srv-02 k8s-dra-driver]# kubectl get pod -A
NAMESPACE NAME READY STATUS RESTARTS AGE
gpu-test1 pod1 0/1 Pending 0 7s
kube-system coredns-5d78c9869d-fv4gx 1/1 Running 0 6m2s
kube-system coredns-5d78c9869d-w72hg 1/1 Running 0 6m2s
kube-system etcd-k8s-dra-driver-cluster-control-plane 1/1 Running 0 6m19s
kube-system kindnet-hz962 1/1 Running 0 6m3s
kube-system kindnet-jw2nt 1/1 Running 0 5m58s
kube-system kube-apiserver-k8s-dra-driver-cluster-control-plane 1/1 Running 0 6m21s
kube-system kube-controller-manager-k8s-dra-driver-cluster-control-plane 1/1 Running 0 6m19s
kube-system kube-proxy-b6q2k 1/1 Running 0 5m58s
kube-system kube-proxy-f2xll 1/1 Running 0 6m3s
kube-system kube-scheduler-k8s-dra-driver-cluster-control-plane 1/1 Running 0 6m19s
local-path-storage local-path-provisioner-6bc4bddd6b-z5hsk 1/1 Running 0 6m2s
nvidia-dra-driver nvidia-k8s-dra-driver-controller-6d6b45756-gswbf 1/1 Running 0 5m37s
nvidia-dra-driver nvidia-k8s-dra-driver-kubelet-plugin-9tkgw 1/1 Running 0 5m37s
Controller logs:
root@nvd-srv-02 k8s-dra-driver]# kubectl logs nvidia-k8s-dra-driver-controller-6d6b45756-gswbf -n nvidia-dra-driver
I1204 14:29:07.307943 1 controller.go:295] "resource controller: Starting" driver="gpu.resource.nvidia.com"
I1204 14:29:07.308034 1 reflector.go:287] Starting reflector *v1alpha2.ResourceClaim (0s) from k8s.io/client-go/informers/factory.go:150
I1204 14:29:07.308707 1 reflector.go:287] Starting reflector *v1alpha2.PodSchedulingContext (0s) from k8s.io/client-go/informers/factory.go:150
I1204 14:29:07.308725 1 reflector.go:323] Listing and watching *v1alpha2.PodSchedulingContext from k8s.io/client-go/informers/factory.go:150
I1204 14:29:07.308716 1 reflector.go:287] Starting reflector *v1alpha2.ResourceClass (0s) from k8s.io/client-go/informers/factory.go:150
I1204 14:29:07.308760 1 reflector.go:323] Listing and watching *v1alpha2.ResourceClass from k8s.io/client-go/informers/factory.go:150
I1204 14:29:07.308810 1 reflector.go:323] Listing and watching *v1alpha2.ResourceClaim from k8s.io/client-go/informers/factory.go:150
I1204 14:29:07.316094 1 round_trippers.go:553] GET https://10.96.0.1:443/apis/resource.k8s.io/v1alpha2/podschedulingcontexts?limit=500&resourceVersion=0 200 OK in 6 milliseconds
I1204 14:29:07.316096 1 round_trippers.go:553] GET https://10.96.0.1:443/apis/resource.k8s.io/v1alpha2/resourceclasses?limit=500&resourceVersion=0 200 OK in 6 milliseconds
I1204 14:29:07.316094 1 round_trippers.go:553] GET https://10.96.0.1:443/apis/resource.k8s.io/v1alpha2/resourceclaims?limit=500&resourceVersion=0 200 OK in 6 milliseconds
I1204 14:29:07.317044 1 round_trippers.go:553] GET https://10.96.0.1:443/apis/resource.k8s.io/v1alpha2/podschedulingcontexts?allowWatchBookmarks=true&resourceVersion=533&timeout=7m41s&timeoutSeconds=461&watch=true 200 OK in 0 milliseconds
I1204 14:29:07.317092 1 round_trippers.go:553] GET https://10.96.0.1:443/apis/resource.k8s.io/v1alpha2/resourceclaims?allowWatchBookmarks=true&resourceVersion=512&timeout=6m21s&timeoutSeconds=381&watch=true 200 OK in 0 milliseconds
I1204 14:29:07.317173 1 round_trippers.go:553] GET https://10.96.0.1:443/apis/resource.k8s.io/v1alpha2/resourceclasses?allowWatchBookmarks=true&resourceVersion=546&timeout=5m39s&timeoutSeconds=339&watch=true 200 OK in 0 milliseconds
I1204 14:29:07.408399 1 shared_informer.go:344] caches populated
I1204 14:34:36.564425 1 controller.go:241] "resource controller: new object" type="ResourceClaim" content="{\"metadata\":{\"name\":\"pod1-mig1g\",\"namespace\":\"gpu-test1\",\"uid\":\"fb5c03ed-8899-4919-bb01-6dcaf05d4c8f\",\"resourceVersion\":\"1057\",\"creationTimestamp\":\"2023-12-04T14:34:36Z\",\"ownerReferences\":[{\"apiVersion\":\"v1\",\"kind\":\"Pod\",\"name\":\"pod1\",\"uid\":\"2c951092-40e6-4df7-994d-e8af41244e9a\",\"controller\":true,\"blockOwnerDeletion\":true}],\"managedFields\":[{\"manager\":\"kube-controller-manager\",\"operation\":\"Update\",\"apiVersion\":\"resource.k8s.io/v1alpha2\",\"time\":\"2023-12-04T14:34:36Z\",\"fieldsType\":\"FieldsV1\",\"fieldsV1\":{\"f:metadata\":{\"f:ownerReferences\":{\".\":{},\"k:{\\\"uid\\\":\\\"2c951092-40e6-4df7-994d-e8af41244e9a\\\"}\":{}}},\"f:spec\":{\"f:allocationMode\":{},\"f:parametersRef\":{\".\":{},\"f:apiGroup\":{},\"f:kind\":{},\"f:name\":{}},\"f:resourceClassName\":{}}}}]},\"spec\":{\"resourceClassName\":\"gpu.nvidia.com\",\"parametersRef\":{\"apiGroup\":\"gpu.resource.nvidia.com\",\"kind\":\"MigDeviceClaimParameters\",\"name\":\"mig-1g.10gb\"},\"allocationMode\":\"WaitForFirstConsumer\"},\"status\":{}}"
I1204 14:34:36.564448 1 controller.go:260] "resource controller: Adding new work item" key="claim:gpu-test1/pod1-mig1g"
I1204 14:34:36.564477 1 controller.go:332] "resource controller: processing" key="claim:gpu-test1/pod1-mig1g"
I1204 14:34:36.564491 1 controller.go:476] "resource controller: ResourceClaim waiting for first consumer" key="claim:gpu-test1/pod1-mig1g"
I1204 14:34:36.564497 1 controller.go:336] "resource controller: completed" key="claim:gpu-test1/pod1-mig1g"
I1204 14:34:38.124563 1 controller.go:241] "resource controller: new object" type="PodSchedulingContext" content="{\"metadata\":{\"name\":\"pod1\",\"namespace\":\"gpu-test1\",\"uid\":\"b64ca68d-ec48-4b40-b773-f66d1a2b7abb\",\"resourceVersion\":\"1059\",\"creationTimestamp\":\"2023-12-04T14:34:38Z\",\"ownerReferences\":[{\"apiVersion\":\"v1\",\"kind\":\"Pod\",\"name\":\"pod1\",\"uid\":\"2c951092-40e6-4df7-994d-e8af41244e9a\",\"controller\":true}],\"managedFields\":[{\"manager\":\"kube-scheduler\",\"operation\":\"Update\",\"apiVersion\":\"resource.k8s.io/v1alpha2\",\"time\":\"2023-12-04T14:34:38Z\",\"fieldsType\":\"FieldsV1\",\"fieldsV1\":{\"f:metadata\":{\"f:ownerReferences\":{\".\":{},\"k:{\\\"uid\\\":\\\"2c951092-40e6-4df7-994d-e8af41244e9a\\\"}\":{}}},\"f:spec\":{\"f:potentialNodes\":{\".\":{},\"v:\\\"k8s-dra-driver-cluster-worker\\\"\":{}},\"f:selectedNode\":{}}}}]},\"spec\":{\"selectedNode\":\"k8s-dra-driver-cluster-worker\",\"potentialNodes\":[\"k8s-dra-driver-cluster-worker\"]},\"status\":{}}"
I1204 14:34:38.124582 1 controller.go:260] "resource controller: Adding new work item" key="schedulingCtx:gpu-test1/pod1"
I1204 14:34:38.124611 1 controller.go:332] "resource controller: processing" key="schedulingCtx:gpu-test1/pod1"
I1204 14:34:38.126303 1 round_trippers.go:553] GET https://10.96.0.1:443/api/v1/namespaces/gpu-test1/pods/pod1 200 OK in 1 milliseconds
I1204 14:34:38.129192 1 round_trippers.go:553] GET https://10.96.0.1:443/apis/gpu.resource.nvidia.com/v1alpha1/namespaces/gpu-test1/migdeviceclaimparameters/mig-1g.10gb 200 OK in 1 milliseconds
I1204 14:34:38.130848 1 round_trippers.go:553] GET https://10.96.0.1:443/apis/nas.gpu.resource.nvidia.com/v1alpha1/namespaces/nvidia-dra-driver/nodeallocationstates/k8s-dra-driver-cluster-worker 200 OK in 1 milliseconds
I1204 14:34:38.131372 1 controller.go:674] "resource controller: pending pod claims" key="schedulingCtx:gpu-test1/pod1" claims=[{PodClaimName:mig1g Claim:&ResourceClaim{ObjectMeta:{pod1-mig1g gpu-test1 fb5c03ed-8899-4919-bb01-6dcaf05d4c8f 1057 0 2023-12-04 14:34:36 +0000 UTC <nil> <nil> map[] map[] [{v1 Pod pod1 2c951092-40e6-4df7-994d-e8af41244e9a 0xc00081a188 0xc00081a189}] [] [{kube-controller-manager Update resource.k8s.io/v1alpha2 2023-12-04 14:34:36 +0000 UTC FieldsV1 {"f:metadata":{"f:ownerReferences":{".":{},"k:{\"uid\":\"2c951092-40e6-4df7-994d-e8af41244e9a\"}":{}}},"f:spec":{"f:allocationMode":{},"f:parametersRef":{".":{},"f:apiGroup":{},"f:kind":{},"f:name":{}},"f:resourceClassName":{}}} }]},Spec:ResourceClaimSpec{ResourceClassName:gpu.nvidia.com,ParametersRef:&ResourceClaimParametersReference{APIGroup:gpu.resource.nvidia.com,Kind:MigDeviceClaimParameters,Name:mig-1g.10gb,},AllocationMode:WaitForFirstConsumer,},Status:ResourceClaimStatus{DriverName:,Allocation:nil,ReservedFor:[]ResourceClaimConsumerReference{},DeallocationRequested:false,},} Class:&ResourceClass{ObjectMeta:{gpu.nvidia.com c570a929-e0d7-40ec-8a0d-4d67fddd16d7 546 0 2023-12-04 14:29:06 +0000 UTC <nil> <nil> map[app.kubernetes.io/managed-by:Helm] map[meta.helm.sh/release-name:nvidia meta.helm.sh/release-namespace:nvidia-dra-driver] [] [] [{helm Update resource.k8s.io/v1alpha2 2023-12-04 14:29:06 +0000 UTC FieldsV1 {"f:driverName":{},"f:metadata":{"f:annotations":{".":{},"f:meta.helm.sh/release-name":{},"f:meta.helm.sh/release-namespace":{}},"f:labels":{".":{},"f:app.kubernetes.io/managed-by":{}}}} }]},DriverName:gpu.resource.nvidia.com,ParametersRef:nil,SuitableNodes:nil,} ClaimParameters:0xc00023e630 ClassParameters:0xc0005de410 UnsuitableNodes:[k8s-dra-driver-cluster-worker]}] selectedNode="k8s-dra-driver-cluster-worker"
I1204 14:34:38.131391 1 controller.go:685] "resource controller: skipping allocation for unsuitable selected node" key="schedulingCtx:gpu-test1/pod1" node="k8s-dra-driver-cluster-worker"
I1204 14:34:38.131441 1 controller.go:724] "resource controller: Updating pod scheduling with modified unsuitable nodes" key="schedulingCtx:gpu-test1/pod1" podSchedulingCtx="&PodSchedulingContext{ObjectMeta:{pod1 gpu-test1 b64ca68d-ec48-4b40-b773-f66d1a2b7abb 1059 0 2023-12-04 14:34:38 +0000 UTC <nil> <nil> map[] map[] [{v1 Pod pod1 2c951092-40e6-4df7-994d-e8af41244e9a 0xc0005aff98 <nil>}] [] [{kube-scheduler Update resource.k8s.io/v1alpha2 2023-12-04 14:34:38 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:ownerReferences\":{\".\":{},\"k:{\\\"uid\\\":\\\"2c951092-40e6-4df7-994d-e8af41244e9a\\\"}\":{}}},\"f:spec\":{\"f:potentialNodes\":{\".\":{},\"v:\\\"k8s-dra-driver-cluster-worker\\\"\":{}},\"f:selectedNode\":{}}} }]},Spec:PodSchedulingContextSpec{SelectedNode:k8s-dra-driver-cluster-worker,PotentialNodes:[k8s-dra-driver-cluster-worker],},Status:PodSchedulingContextStatus{ResourceClaims:[]ResourceClaimSchedulingStatus{ResourceClaimSchedulingStatus{Name:mig1g,UnsuitableNodes:[k8s-dra-driver-cluster-worker],},},},}"
I1204 14:34:38.133112 1 round_trippers.go:553] PUT https://10.96.0.1:443/apis/resource.k8s.io/v1alpha2/namespaces/gpu-test1/podschedulingcontexts/pod1/status 200 OK in 1 milliseconds
I1204 14:34:38.133282 1 controller.go:342] "resource controller: recheck periodically" key="schedulingCtx:gpu-test1/pod1"
I1204 14:34:38.135512 1 controller.go:249] "resource controller: updated object" type="PodSchedulingContext" content="{\"metadata\":{\"name\":\"pod1\",\"namespace\":\"gpu-test1\",\"uid\":\"b64ca68d-ec48-4b40-b773-f66d1a2b7abb\",\"resourceVersion\":\"1062\",\"creationTimestamp\":\"2023-12-04T14:34:38Z\",\"ownerReferences\":[{\"apiVersion\":\"v1\",\"kind\":\"Pod\",\"name\":\"pod1\",\"uid\":\"2c951092-40e6-4df7-994d-e8af41244e9a\",\"controller\":true}],\"managedFields\":[{\"manager\":\"kube-scheduler\",\"operation\":\"Update\",\"apiVersion\":\"resource.k8s.io/v1alpha2\",\"time\":\"2023-12-04T14:34:38Z\",\"fieldsType\":\"FieldsV1\",\"fieldsV1\":{\"f:metadata\":{\"f:ownerReferences\":{\".\":{},\"k:{\\\"uid\\\":\\\"2c951092-40e6-4df7-994d-e8af41244e9a\\\"}\":{}}},\"f:spec\":{\"f:potentialNodes\":{\".\":{},\"v:\\\"k8s-dra-driver-cluster-worker\\\"\":{}},\"f:selectedNode\":{}}}},{\"manager\":\"nvidia-dra-controller\",\"operation\":\"Update\",\"apiVersion\":\"resource.k8s.io/v1alpha2\",\"time\":\"2023-12-04T14:34:38Z\",\"fieldsType\":\"FieldsV1\",\"fieldsV1\":{\"f:status\":{\"f:resourceClaims\":{\".\":{},\"k:{\\\"name\\\":\\\"mig1g\\\"}\":{\".\":{},\"f:name\":{},\"f:unsuitableNodes\":{\".\":{},\"v:\\\"k8s-dra-driver-cluster-worker\\\"\":{}}}}}},\"subresource\":\"status\"}]},\"spec\":{\"selectedNode\":\"k8s-dra-driver-cluster-worker\",\"potentialNodes\":[\"k8s-dra-driver-cluster-worker\"]},\"status\":{\"resourceClaims\":[{\"name\":\"mig1g\",\"unsuitableNodes\":[\"k8s-dra-driver-cluster-worker\"]}]}}" diff=<
&v1alpha2.PodSchedulingContext{
TypeMeta: {},
ObjectMeta: v1.ObjectMeta{
... // 3 identical fields
SelfLink: "",
UID: "b64ca68d-ec48-4b40-b773-f66d1a2b7abb",
- ResourceVersion: "1059",
+ ResourceVersion: "1062",
Generation: 0,
CreationTimestamp: {Time: s"2023-12-04 14:34:38 +0000 UTC"},
... // 4 identical fields
OwnerReferences: {{APIVersion: "v1", Kind: "Pod", Name: "pod1", UID: "2c951092-40e6-4df7-994d-e8af41244e9a", ...}},
Finalizers: nil,
ManagedFields: []v1.ManagedFieldsEntry{
{Manager: "kube-scheduler", Operation: "Update", APIVersion: "resource.k8s.io/v1alpha2", Time: s"2023-12-04 14:34:38 +0000 UTC", ...},
+ {
+ Manager: "nvidia-dra-controller",
+ Operation: "Update",
+ APIVersion: "resource.k8s.io/v1alpha2",
+ Time: s"2023-12-04 14:34:38 +0000 UTC",
+ FieldsType: "FieldsV1",
+ FieldsV1: s`{"f:status":{"f:resourceClaims":{".":{},"k:{\"name\":\"mig1g\"}":{".":{},"f:name":{},"f:unsuitableNodes":{".":{},"v:\"k8s-dra-dr`...,
+ Subresource: "status",
+ },
},
},
Spec: {SelectedNode: "k8s-dra-driver-cluster-worker", PotentialNodes: {"k8s-dra-driver-cluster-worker"}},
- Status: v1alpha2.PodSchedulingContextStatus{},
+ Status: v1alpha2.PodSchedulingContextStatus{
+ ResourceClaims: []v1alpha2.ResourceClaimSchedulingStatus{{Name: "mig1g", UnsuitableNodes: []string{"k8s-dra-driver-cluster-worker"}}},
+ },
}
>
I1204 14:34:38.135527 1 controller.go:260] "resource controller: Adding updated work item" key="schedulingCtx:gpu-test1/pod1"
I1204 14:34:38.135548 1 controller.go:332] "resource controller: processing" key="schedulingCtx:gpu-test1/pod1"
I1204 14:34:38.136496 1 round_trippers.go:553] GET https://10.96.0.1:443/api/v1/namespaces/gpu-test1/pods/pod1 200 OK in 0 milliseconds
I1204 14:34:38.137831 1 round_trippers.go:553] GET https://10.96.0.1:443/apis/gpu.resource.nvidia.com/v1alpha1/namespaces/gpu-test1/migdeviceclaimparameters/mig-1g.10gb 200 OK in 1 milliseconds
I1204 14:34:38.139611 1 round_trippers.go:553] GET https://10.96.0.1:443/apis/nas.gpu.resource.nvidia.com/v1alpha1/namespaces/nvidia-dra-driver/nodeallocationstates/k8s-dra-driver-cluster-worker 200 OK in 1 milliseconds
I1204 14:34:38.139928 1 controller.go:674] "resource controller: pending pod claims" key="schedulingCtx:gpu-test1/pod1" claims=[{PodClaimName:mig1g Claim:&ResourceClaim{ObjectMeta:{pod1-mig1g gpu-test1 fb5c03ed-8899-4919-bb01-6dcaf05d4c8f 1057 0 2023-12-04 14:34:36 +0000 UTC <nil> <nil> map[] map[] [{v1 Pod pod1 2c951092-40e6-4df7-994d-e8af41244e9a 0xc00081a188 0xc00081a189}] [] [{kube-controller-manager Update resource.k8s.io/v1alpha2 2023-12-04 14:34:36 +0000 UTC FieldsV1 {"f:metadata":{"f:ownerReferences":{".":{},"k:{\"uid\":\"2c951092-40e6-4df7-994d-e8af41244e9a\"}":{}}},"f:spec":{"f:allocationMode":{},"f:parametersRef":{".":{},"f:apiGroup":{},"f:kind":{},"f:name":{}},"f:resourceClassName":{}}} }]},Spec:ResourceClaimSpec{ResourceClassName:gpu.nvidia.com,ParametersRef:&ResourceClaimParametersReference{APIGroup:gpu.resource.nvidia.com,Kind:MigDeviceClaimParameters,Name:mig-1g.10gb,},AllocationMode:WaitForFirstConsumer,},Status:ResourceClaimStatus{DriverName:,Allocation:nil,ReservedFor:[]ResourceClaimConsumerReference{},DeallocationRequested:false,},} Class:&ResourceClass{ObjectMeta:{gpu.nvidia.com c570a929-e0d7-40ec-8a0d-4d67fddd16d7 546 0 2023-12-04 14:29:06 +0000 UTC <nil> <nil> map[app.kubernetes.io/managed-by:Helm] map[meta.helm.sh/release-name:nvidia meta.helm.sh/release-namespace:nvidia-dra-driver] [] [] [{helm Update resource.k8s.io/v1alpha2 2023-12-04 14:29:06 +0000 UTC FieldsV1 {"f:driverName":{},"f:metadata":{"f:annotations":{".":{},"f:meta.helm.sh/release-name":{},"f:meta.helm.sh/release-namespace":{}},"f:labels":{".":{},"f:app.kubernetes.io/managed-by":{}}}} }]},DriverName:gpu.resource.nvidia.com,ParametersRef:nil,SuitableNodes:nil,} ClaimParameters:0xc000489500 ClassParameters:0xc0005b2198 UnsuitableNodes:[k8s-dra-driver-cluster-worker]}] selectedNode="k8s-dra-driver-cluster-worker"
I1204 14:34:38.139957 1 controller.go:685] "resource controller: skipping allocation for unsuitable selected node" key="schedulingCtx:gpu-test1/pod1" node="k8s-dra-driver-cluster-worker"
I1204 14:34:38.139973 1 controller.go:342] "resource controller: recheck periodically" key="schedulingCtx:gpu-test1/pod1"
I1204 14:34:46.317623 1 reflector.go:788] k8s.io/client-go/informers/factory.go:150: Watch close - *v1alpha2.ResourceClass total 7 items received
I1204 14:34:46.318175 1 round_trippers.go:553] GET https://10.96.0.1:443/apis/resource.k8s.io/v1alpha2/resourceclasses?allowWatchBookmarks=true&resourceVersion=1074&timeout=7m55s&timeoutSeconds=475&watch=true 200 OK in 0 milliseconds
I1204 14:35:08.134362 1 controller.go:332] "resource controller: processing" key="schedulingCtx:gpu-test1/pod1"
I1204 14:35:08.136101 1 round_trippers.go:553] GET https://10.96.0.1:443/api/v1/namespaces/gpu-test1/pods/pod1 200 OK in 1 milliseconds
I1204 14:35:08.137793 1 round_trippers.go:553] GET https://10.96.0.1:443/apis/gpu.resource.nvidia.com/v1alpha1/namespaces/gpu-test1/migdeviceclaimparameters/mig-1g.10gb 200 OK in 1 milliseconds
I1204 14:35:08.139435 1 round_trippers.go:553] GET https://10.96.0.1:443/apis/nas.gpu.resource.nvidia.com/v1alpha1/namespaces/nvidia-dra-driver/nodeallocationstates/k8s-dra-driver-cluster-worker 200 OK in 1 milliseconds
I1204 14:35:08.139721 1 controller.go:674] "resource controller: pending pod claims" key="schedulingCtx:gpu-test1/pod1" claims=[{PodClaimName:mig1g Claim:&ResourceClaim{ObjectMeta:{pod1-mig1g gpu-test1 fb5c03ed-8899-4919-bb01-6dcaf05d4c8f 1057 0 2023-12-04 14:34:36 +0000 UTC <nil> <nil> map[] map[] [{v1 Pod pod1 2c951092-40e6-4df7-994d-e8af41244e9a 0xc00081a188 0xc00081a189}] [] [{kube-controller-manager Update resource.k8s.io/v1alpha2 2023-12-04 14:34:36 +0000 UTC FieldsV1 {"f:metadata":{"f:ownerReferences":{".":{},"k:{\"uid\":\"2c951092-40e6-4df7-994d-e8af41244e9a\"}":{}}},"f:spec":{"f:allocationMode":{},"f:parametersRef":{".":{},"f:apiGroup":{},"f:kind":{},"f:name":{}},"f:resourceClassName":{}}} }]},Spec:ResourceClaimSpec{ResourceClassName:gpu.nvidia.com,ParametersRef:&ResourceClaimParametersReference{APIGroup:gpu.resource.nvidia.com,Kind:MigDeviceClaimParameters,Name:mig-1g.10gb,},AllocationMode:WaitForFirstConsumer,},Status:ResourceClaimStatus{DriverName:,Allocation:nil,ReservedFor:[]ResourceClaimConsumerReference{},DeallocationRequested:false,},} Class:&ResourceClass{ObjectMeta:{gpu.nvidia.com c570a929-e0d7-40ec-8a0d-4d67fddd16d7 546 0 2023-12-04 14:29:06 +0000 UTC <nil> <nil> map[app.kubernetes.io/managed-by:Helm] map[meta.helm.sh/release-name:nvidia meta.helm.sh/release-namespace:nvidia-dra-driver] [] [] [{helm Update resource.k8s.io/v1alpha2 2023-12-04 14:29:06 +0000 UTC FieldsV1 {"f:driverName":{},"f:metadata":{"f:annotations":{".":{},"f:meta.helm.sh/release-name":{},"f:meta.helm.sh/release-namespace":{}},"f:labels":{".":{},"f:app.kubernetes.io/managed-by":{}}}} }]},DriverName:gpu.resource.nvidia.com,ParametersRef:nil,SuitableNodes:nil,} ClaimParameters:0xc000368f00 ClassParameters:0xc0005de098 UnsuitableNodes:[k8s-dra-driver-cluster-worker]}] selectedNode="k8s-dra-driver-cluster-worker"
You are trying to align your MIG device allocation against the allocation of a full GPU who's claim name is mig-enabled-gpu
, yet you never create and/or reference a claim with this name. You have the GpuClaimParameters
for it, but you never reference it in a claim, and you never request access to this claim in your pod.
If you actually don't care about ensuring MIG allocation on a specific GPU, just get rid of the GpuClaimParameters
object for this full GPU, and remove the reference to gpuClaimName: "mig-enabled-gpu"
in your MigClaimParameters
object. Then a "random" MIG-enabled GPU will be picked and a MIG device will be allocated on it.
The existence of the gpuClaimName: "mig-enabled-gpu"
is to ensure that multiple MIG devices are created on the same underlying GPU, rather than being arbitrarily spread out over many GPUs.
Thanks, @klueska , that worked!
Hello,
We see below logs from the DRA controller where it is skipping the node to allocate claims:
We are running below example:
any pointers are appreciated, thanks