`[Cherry-pick rel-v1.30]` Support extended resources and ephemeral-storage for scale-from-zero specified in MachineClass NodeTemplate

elankath commented 5 days ago

What this PR does / why we need it:

Right now in scale-from-zero cases, the autoscaler does not respect ephemeral-storage and extended resource specified in the MachineClass.NodeTemplate . Only the standard cpu, gpu and memory are picked up. Neither is there any support for custom extended resource which are fully ignored presently.

Which issue(s) this PR fixes: Fixes #132

Special notes for your reviewer:

Release note:

Support extended resources and ephemeral-storage for scale-from-zero specified in MachineClass NodeTemplate

elankath commented 5 days ago

IT successful. ca-it.log

make test-integration 2>&1 | tee /tmp/ca-it.log                                                                               git:exres*
....
Ran 13 of 13 Specs in 1334.708 seconds
SUCCESS! -- 13 Passed | 0 Failed | 0 Pending | 0 Skipped
PASS

elankath commented 4 days ago

Test Case

Shoot Setup

Shoot with 3 Worker Pools, Only the 2nd Worker Pool with minimum pool size specified as 0 has special dongle resource as shown in shoot YAML snippet below

        providerConfig:
          apiVersion: aws.provider.extensions.gardener.cloud/v1alpha1
          kind: WorkerConfig
          nodeTemplate:
            capacity:
              cpu: 8
              ephemeral-storage: 50Gi
              gpu: 0
              hana.hc.sap.com/hcu/cpu: 20
              hana.hc.sap.com/hcu/memory: 10
              memory: 7Gi
              resource.com/dongle: 4

Pod with Extended Resource

As can be seen, due to k8s restrictions, extended resource declaration must be specified in both requests and limits section.

apiVersion: v1
kind: Pod
metadata:
  name: exrestest1
spec:
  containers:
    - name: example-container
      image: busybox
      command: ["sh", "-c", "echo Using extended resources && sleep 3600"]
      resources:
        limits:
          resource.com/dongle: "2"
        requests:
          resource.com/dongle: "2"

CA Scale Up

As can be seen below, the b NodeGroup which declares extended resource was selected for scale-up. Final scale-up plan: [{shoot--i034796--aw-b-z1 0->1 (max: 3)}]. Other Node Groups were rejected

CA Scale Up Log

I1121 11:49:33.105893   58938 mcm_manager.go:791] Generating node template only using nodeTemplate from MachineClass shoot--i034796--aw-b-z1-f8c99: template resources-> cpu: 8,memory: 7Gi
I1121 11:49:33.105972   58938 mcm_manager.go:981] Copying extended resources map[hana.hc.sap.com/hcu/cpu:{{20 0} {<nil>} 20 DecimalSI} hana.hc.sap.com/hcu/memory:{{10 0} {<nil>} 10 DecimalSI} resource.com/dongle:{{4 0} {<nil>} 4 DecimalSI}] to template node.Status.Capacity
I1121 11:49:33.108659   58938 orchestrator.go:566] Pod default/exrestest1 can't be scheduled on shoot--i034796--aw-c-z1, predicate checking error: Insufficient resource.com/dongle; predicateName=NodeResourcesFit; reasons: Insufficient resource.com/dongle; debugInfo=
I1121 11:49:33.108794   58938 orchestrator.go:566] Pod default/exrestest1 can't be scheduled on shoot--i034796--aw-a-z1, predicate checking error: Insufficient resource.com/dongle; predicateName=NodeResourcesFit; reasons: Insufficient resource.com/dongle; debugInfo=
I1121 11:49:33.108967   58938 orchestrator.go:153] No pod can fit to shoot--i034796--aw-c-z1
I1121 11:49:33.108980   58938 orchestrator.go:153] No pod can fit to shoot--i034796--aw-a-z1
I1121 11:49:33.109082   58938 compare_nodegroups.go:164] nodes template-node-for-shoot--i034796--aw-b-z1-6493561263961564664 and template-node-for-shoot--i034796--aw-c-z1-397814316375267006 are not similar, ephemeral-storage does not match
I1121 11:49:33.109113   58938 compare_nodegroups.go:164] nodes template-node-for-shoot--i034796--aw-b-z1-6493561263961564664 and template-node-for-shoot--i034796--aw-a-z1-5270490573304173618 are not similar, cpu does not match
I1121 11:49:33.109747   58938 waste.go:55] Expanding Node Group shoot--i034796--aw-b-z1 would waste 100.00% CPU, 100.00% Memory, 100.00% Blended
I1121 11:49:33.109783   58938 orchestrator.go:184] Best option to resize: shoot--i034796--aw-b-z1
I1121 11:49:33.109796   58938 orchestrator.go:188] Estimated 1 nodes needed in shoot--i034796--aw-b-z1
I1121 11:49:33.109832   58938 compare_nodegroups.go:164] nodes template-node-for-shoot--i034796--aw-b-z1-6493561263961564664 and template-node-for-shoot--i034796--aw-a-z1-5270490573304173618 are not similar, cpu does not match
I1121 11:49:33.109892   58938 compare_nodegroups.go:150] nodes template-node-for-shoot--i034796--aw-b-z1-6493561263961564664 and template-node-for-shoot--i034796--aw-c-z1-397814316375267006 are not similar, missing capacity hana.hc.sap.com/hcu/memory
I1121 11:49:33.109911   58938 orchestrator.go:215] No similar node groups found
I1121 11:49:33.109931   58938 orchestrator.go:257] Final scale-up plan: [{shoot--i034796--aw-b-z1 0->1 (max: 3)}]
I1121 11:49:33.110018   58938 executor.go:147] Scale-up: setting group shoot--i034796--aw-b-z1 size to 1
I1121 11:49:33.110128   58938 mcm_cloud_provider.go:351] Received request to increase size of machine deployment shoot--i034796--aw-b-z1 by 1
I1121 11:49:33.110321   58938 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"2b3bbc7e-927e-48fb-b3cd-1522eb3148e5", APIVersion:"v1", ResourceVersion:"7204541", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group shoot--i034796--aw-b-z1 size to 1 instead of 0 (max: 3)

Pod Assignment.

As can be seen the Pod declaring extended resource is still unassigned after scale-up. This is due to the fact that it is expected that controller/policy explicitly adds the extended resource to the Node.Status.Capacity as described at k8s documentation Advertise Extended Resources for a Node

k get po                                                                                             git:exres*
NAME         READY   STATUS    RESTARTS   AGE
exrestest1   0/1     Pending   0          10m

Currently, stakeholders are doing this with Kyverno policies to update the scaled Nodeobjects with extended resources, but this is inconvenient and somewhat error prone due to possible divergence of the values in the CA Node Group and Kyverno policy. On request, we created an additional feature enhancement ticket on the gardener MCMat: https://github.com/gardener/machine-controller-manager/issues/955 which will be delivered in the future

Update Scaled Node with extended Resources

NOTE: This step is performed by policy - replicating manually for testing.

Proxy the API server

kubectl prosxy

Patch extended resource

 curl --header "Content-Type: application/json-patch+json" --request PATCH --data '[{"op": "add", "path": "/status/capacity/resource.com~1dongle", "value": "4"}]' http://localhost:8001/api/v1/nodes/ip-10-180-31-22.eu-west-1.compute.internal/status

Pod is now Scheduled.

49s         Normal    Scheduled                           pod/exrestest1                                    Successfully assigned default/exrestest1 to ip-10-180-31-22.eu-west-1.compute.internal

NAME         READY   STATUS    RESTARTS   AGE
exrestest1   1/1     Running   0          22m

gardener / autoscaler