ClusterAutoscaler does not back off FAST correctly from ResourceExhausted error in some situation

dongyingbo commented 1 month ago

What happened: We saw cluster creation slowness when default(m6i.2xlarge) was out of capacity even we had other nodegroups allowing system components in the meanwhile.

We have default(m6i.2xlarge) enabled on zones eu-central-1a, eu-central-1b and eu-central-1c. The log showed it was out of capacity on both zoneA and zoneC. However I saw CA marked nodegroup unhealthy on zoneC quickly but it had not marked nodegroup unhealthy on zoneA in more than 20 minutes. What is suspicious to me is that, for nodegroup on zoneA I saw many logs like, {"log":"Error while trying to delete nodes from shoot--hc-dev--i502777-2-orc-default-z1: MachineDeployment shoot--hc-dev--myshoot-2-orc-default-z1 is under rolling update , cannot reduce replica count","pid":"1","severity":"WARN","source":"static_autoscaler.go:898"} But I did not see similar log for nodegroup on zoneC.

What you expected to happen: Nodegroup should be backed off fast for ResourceExhausted error in any situation.

How to reproduce it (as minimally and precisely as possible): There is no easy way to simulate node type out of capacity.

Anything else we need to know: N/A

Environment: N/A

dongyingbo commented 1 month ago

Is it something can be improved by new flags planed in https://github.com/gardener/autoscaler/issues/176?

dongyingbo commented 1 month ago

Closing as I can not provide detailed log for now.

gardener / autoscaler

ClusterAutoscaler does not back off FAST correctly from ResourceExhausted error in some situation #330