What happened:
A pod was scheduled on a node. Customer increased the size of the PVC, and also changed the nodeSelector for the pod to move it to another pool. But the CA couldn't trigger scale-up in the same zone where PV is, in the new worker-pool.
CA logs
{"log":"Pod loki-new-0 can't be scheduled on shoot--hc-can-gc--sacdrgcp-dmi-observability-l-z3, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=","pid":"1","severity":"INFO","source":"scale_up.go:300"}
2023-02-21 21:04:47
{"log":"Pod storagegateway-f47884d6f-kphrj can't be scheduled on shoot--hc-can-gc--sacdrgcp-dmi-observability-l-z3, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=","pid":"1","severity":"INFO","source":"scale_up.go:300"}
2023-02-21 21:04:47
{"log":"Pod loki-new-0 can't be scheduled on shoot--hc-can-gc--sacdrgcp-dmi-observability-l-z2, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=","pid":"1","severity":"INFO","source":"scale_up.go:300"}
2023-02-21 21:04:47
{"log":"Pod storagegateway-f47884d6f-kphrj can't be scheduled on shoot--hc-can-gc--sacdrgcp-dmi-observability-l-z2, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=","pid":"1","severity":"INFO","source":"scale_up.go:300"}
2023-02-21 21:04:47
{"log":"Pod loki-new-0 can't be scheduled on shoot--hc-can-gc--sacdrgcp-dmi-observability-l-z1, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=","pid":"1","severity":"INFO","source":"scale_up.go:300"}
2023-02-21 21:04:47
{"log":"Pod storagegateway-f47884d6f-kphrj can't be scheduled on shoot--hc-can-gc--sacdrgcp-dmi-observability-l-z1, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=","pid":"1","severity":"INFO","source":"scale_up.go:300"}
2023-02-21 21:04:47
{"log":"Generating node template only using nodeTemplate from MachineClass shoot--hc-can-gc--sacdrgcp-dmi-observability-l-z1-0fa0f: template resources-\u003e cpu: 8,memory: 64Gi","pid":"1","severity":"INFO","source":"mcm_manager.go:673"}
2023-02-21 21:04:47
{"log":"Generating node template only using nodeTemplate from MachineClass shoot--hc-can-gc--sacdrgcp-dmi-observability-l-z3-0fa0f: template resources-\u003e cpu: 8,memory: 64Gi","pid":"1","severity":"INFO","source":"mcm_manager.go:673"}
2023-02-21 21:04:47
{"log":"Generating node template only using nodeTemplate from MachineClass shoot--hc-can-gc--sacdrgcp-dmi-observability-l-z2-0fa0f: template resources-\u003e cpu: 8,memory: 64Gi","pid":"1","severity":"INFO","source":"mcm_manager.go:673"}```
PV's zone affinity
k get pv pv--69c481e2-0961-48bc-9b88-e8f51f1c7113 -o yaml
...
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: topology.gke.io/zone
operator: In
values:
- europe-west3-a
The entire worker pool was scaling from zero , so the only reference for forming node Template was machineClass for CA
Generating node template only using nodeTemplate from MachineClass
What you expected to happen:
Autoscaler to trigger scale-from-zero
How to reproduce it (as minimally and precisely as possible):
Explained in the issue description
Anything else we need to know:
Seen when GCP PD CSI driver is used. Enabled in GCP from k8s >=1.18
What happened: A pod was scheduled on a node. Customer increased the size of the PVC, and also changed the nodeSelector for the pod to move it to another pool. But the CA couldn't trigger scale-up in the same zone where PV is, in the new worker-pool.
CA logs
PV's zone affinity
The entire worker pool was scaling from zero , so the only reference for forming node Template was machineClass for CA
What you expected to happen: Autoscaler to trigger scale-from-zero
How to reproduce it (as minimally and precisely as possible): Explained in the issue description
Anything else we need to know: Seen when GCP PD CSI driver is used. Enabled in GCP from k8s >=1.18
Environment: Similar to #113 live issue # 2626