Closed elankath closed 4 days ago
IT successful. ca-it.log
make test-integration 2>&1 | tee /tmp/ca-it.log git:exres*
....
Ran 13 of 13 Specs in 1334.708 seconds
SUCCESS! -- 13 Passed | 0 Failed | 0 Pending | 0 Skipped
PASS
Shoot with 3 Worker Pools, Only the 2nd Worker Pool with minimum pool size specified as 0
has special dongle
resource as shown in shoot YAML snippet below
providerConfig:
apiVersion: aws.provider.extensions.gardener.cloud/v1alpha1
kind: WorkerConfig
nodeTemplate:
capacity:
cpu: 8
ephemeral-storage: 50Gi
gpu: 0
hana.hc.sap.com/hcu/cpu: 20
hana.hc.sap.com/hcu/memory: 10
memory: 7Gi
resource.com/dongle: 4
As can be seen, due to k8s restrictions, extended resource declaration must be specified in both requests
and limits
section.
apiVersion: v1
kind: Pod
metadata:
name: exrestest1
spec:
containers:
- name: example-container
image: busybox
command: ["sh", "-c", "echo Using extended resources && sleep 3600"]
resources:
limits:
resource.com/dongle: "2"
requests:
resource.com/dongle: "2"
As can be seen below, the b
NodeGroup which declares extended resource was selected for scale-up. Final scale-up plan: [{shoot--i034796--aw-b-z1 0->1 (max: 3)}]
. Other Node Groups were rejected
I1121 11:49:33.105893 58938 mcm_manager.go:791] Generating node template only using nodeTemplate from MachineClass shoot--i034796--aw-b-z1-f8c99: template resources-> cpu: 8,memory: 7Gi
I1121 11:49:33.105972 58938 mcm_manager.go:981] Copying extended resources map[hana.hc.sap.com/hcu/cpu:{{20 0} {<nil>} 20 DecimalSI} hana.hc.sap.com/hcu/memory:{{10 0} {<nil>} 10 DecimalSI} resource.com/dongle:{{4 0} {<nil>} 4 DecimalSI}] to template node.Status.Capacity
I1121 11:49:33.108659 58938 orchestrator.go:566] Pod default/exrestest1 can't be scheduled on shoot--i034796--aw-c-z1, predicate checking error: Insufficient resource.com/dongle; predicateName=NodeResourcesFit; reasons: Insufficient resource.com/dongle; debugInfo=
I1121 11:49:33.108794 58938 orchestrator.go:566] Pod default/exrestest1 can't be scheduled on shoot--i034796--aw-a-z1, predicate checking error: Insufficient resource.com/dongle; predicateName=NodeResourcesFit; reasons: Insufficient resource.com/dongle; debugInfo=
I1121 11:49:33.108967 58938 orchestrator.go:153] No pod can fit to shoot--i034796--aw-c-z1
I1121 11:49:33.108980 58938 orchestrator.go:153] No pod can fit to shoot--i034796--aw-a-z1
I1121 11:49:33.109082 58938 compare_nodegroups.go:164] nodes template-node-for-shoot--i034796--aw-b-z1-6493561263961564664 and template-node-for-shoot--i034796--aw-c-z1-397814316375267006 are not similar, ephemeral-storage does not match
I1121 11:49:33.109113 58938 compare_nodegroups.go:164] nodes template-node-for-shoot--i034796--aw-b-z1-6493561263961564664 and template-node-for-shoot--i034796--aw-a-z1-5270490573304173618 are not similar, cpu does not match
I1121 11:49:33.109747 58938 waste.go:55] Expanding Node Group shoot--i034796--aw-b-z1 would waste 100.00% CPU, 100.00% Memory, 100.00% Blended
I1121 11:49:33.109783 58938 orchestrator.go:184] Best option to resize: shoot--i034796--aw-b-z1
I1121 11:49:33.109796 58938 orchestrator.go:188] Estimated 1 nodes needed in shoot--i034796--aw-b-z1
I1121 11:49:33.109832 58938 compare_nodegroups.go:164] nodes template-node-for-shoot--i034796--aw-b-z1-6493561263961564664 and template-node-for-shoot--i034796--aw-a-z1-5270490573304173618 are not similar, cpu does not match
I1121 11:49:33.109892 58938 compare_nodegroups.go:150] nodes template-node-for-shoot--i034796--aw-b-z1-6493561263961564664 and template-node-for-shoot--i034796--aw-c-z1-397814316375267006 are not similar, missing capacity hana.hc.sap.com/hcu/memory
I1121 11:49:33.109911 58938 orchestrator.go:215] No similar node groups found
I1121 11:49:33.109931 58938 orchestrator.go:257] Final scale-up plan: [{shoot--i034796--aw-b-z1 0->1 (max: 3)}]
I1121 11:49:33.110018 58938 executor.go:147] Scale-up: setting group shoot--i034796--aw-b-z1 size to 1
I1121 11:49:33.110128 58938 mcm_cloud_provider.go:351] Received request to increase size of machine deployment shoot--i034796--aw-b-z1 by 1
I1121 11:49:33.110321 58938 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"2b3bbc7e-927e-48fb-b3cd-1522eb3148e5", APIVersion:"v1", ResourceVersion:"7204541", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group shoot--i034796--aw-b-z1 size to 1 instead of 0 (max: 3)
As can be seen the Pod
declaring extended resource is still unassigned after scale-up. This is due to the fact that it is expected that controller/policy explicitly adds the extended resource to the Node.Status.Capacity
as described at k8s documentation Advertise Extended Resources for a Node
k get po git:exres*
NAME READY STATUS RESTARTS AGE
exrestest1 0/1 Pending 0 10m
Currently, stakeholders are doing this with Kyverno policies to update the scaled Node
objects with extended resources, but this is inconvenient and somewhat error prone due to possible divergence of the values in the CA Node Group and Kyverno policy. On request, we created an additional feature enhancement ticket on the gardener MCMat: https://github.com/gardener/machine-controller-manager/issues/955 which will be delivered in the future
NOTE: This step is performed by policy - replicating manually for testing.
kubectl prosxy
curl --header "Content-Type: application/json-patch+json" --request PATCH --data '[{"op": "add", "path": "/status/capacity/resource.com~1dongle", "value": "4"}]' http://localhost:8001/api/v1/nodes/ip-10-180-31-22.eu-west-1.compute.internal/status
49s Normal Scheduled pod/exrestest1 Successfully assigned default/exrestest1 to ip-10-180-31-22.eu-west-1.compute.internal
NAME READY STATUS RESTARTS AGE
exrestest1 1/1 Running 0 22m
What this PR does / why we need it:
Right now in scale-from-zero cases, the autoscaler does not respect ephemeral-storage and extended resource specified in the
MachineClass.NodeTemplate
. Only the standardcpu
,gpu
andmemory
are picked up. Neither is there any support for custom extended resource which are fully ignored presently.Which issue(s) this PR fixes: Fixes #132
Special notes for your reviewer:
Release note: