gardener / autoscaler

Customised fork of cluster-autoscaler to support machine-controller-manager
Apache License 2.0
16 stars 25 forks source link

Scale from Zero doesn't work in GCP for k8s >=1.18 in case of pods with PVs #182

Closed himanshu-kun closed 1 year ago

himanshu-kun commented 1 year ago

What happened: A pod was scheduled on a node. Customer increased the size of the PVC, and also changed the nodeSelector for the pod to move it to another pool. But the CA couldn't trigger scale-up in the same zone where PV is, in the new worker-pool.

CA logs

{"log":"Pod loki-new-0 can't be scheduled on shoot--hc-can-gc--sacdrgcp-dmi-observability-l-z3, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=","pid":"1","severity":"INFO","source":"scale_up.go:300"}
2023-02-21 21:04:47 
{"log":"Pod storagegateway-f47884d6f-kphrj can't be scheduled on shoot--hc-can-gc--sacdrgcp-dmi-observability-l-z3, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=","pid":"1","severity":"INFO","source":"scale_up.go:300"}
2023-02-21 21:04:47 
{"log":"Pod loki-new-0 can't be scheduled on shoot--hc-can-gc--sacdrgcp-dmi-observability-l-z2, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=","pid":"1","severity":"INFO","source":"scale_up.go:300"}
2023-02-21 21:04:47 
{"log":"Pod storagegateway-f47884d6f-kphrj can't be scheduled on shoot--hc-can-gc--sacdrgcp-dmi-observability-l-z2, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=","pid":"1","severity":"INFO","source":"scale_up.go:300"}
2023-02-21 21:04:47 
{"log":"Pod loki-new-0 can't be scheduled on shoot--hc-can-gc--sacdrgcp-dmi-observability-l-z1, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=","pid":"1","severity":"INFO","source":"scale_up.go:300"}
2023-02-21 21:04:47 
{"log":"Pod storagegateway-f47884d6f-kphrj can't be scheduled on shoot--hc-can-gc--sacdrgcp-dmi-observability-l-z1, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=","pid":"1","severity":"INFO","source":"scale_up.go:300"}
2023-02-21 21:04:47 
{"log":"Generating node template only using nodeTemplate from MachineClass shoot--hc-can-gc--sacdrgcp-dmi-observability-l-z1-0fa0f: template resources-\u003e cpu: 8,memory: 64Gi","pid":"1","severity":"INFO","source":"mcm_manager.go:673"}
2023-02-21 21:04:47 
{"log":"Generating node template only using nodeTemplate from MachineClass shoot--hc-can-gc--sacdrgcp-dmi-observability-l-z3-0fa0f: template resources-\u003e cpu: 8,memory: 64Gi","pid":"1","severity":"INFO","source":"mcm_manager.go:673"}
2023-02-21 21:04:47 
{"log":"Generating node template only using nodeTemplate from MachineClass shoot--hc-can-gc--sacdrgcp-dmi-observability-l-z2-0fa0f: template resources-\u003e cpu: 8,memory: 64Gi","pid":"1","severity":"INFO","source":"mcm_manager.go:673"}```

PV's zone affinity

k get pv pv--69c481e2-0961-48bc-9b88-e8f51f1c7113 -o yaml
...
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.gke.io/zone
          operator: In
          values:
          - europe-west3-a

The entire worker pool was scaling from zero , so the only reference for forming node Template was machineClass for CA

Generating node template only using nodeTemplate from MachineClass

What you expected to happen: Autoscaler to trigger scale-from-zero

How to reproduce it (as minimally and precisely as possible): Explained in the issue description

Anything else we need to know: Seen when GCP PD CSI driver is used. Enabled in GCP from k8s >=1.18

Environment: Similar to #113 live issue # 2626

elankath commented 1 year ago

picked this up

himanshu-kun commented 1 year ago

We need to adapt this IT also https://github.com/gardener/autoscaler/blob/5c55843a80068a67a1793fc3d418f4569eb8ad1b/cluster-autoscaler/integration/integration_test.go#L258-L259

himanshu-kun commented 1 year ago

/milestone v1.27 , as it'll go in CA release v1.27.0