[Feature] Allow a configurable delay between node upgrades to ensure pods have enough time to recover during upgrades.

tepley commented 1 year ago

Is your feature request related to a problem? Please describe. Whenever we upgrade aks to a new k8s version that has a windows nodepool the upgrade replaces nodes faster than the pods can recover in. This usually results in a total complete outage of the entire cluster for 20 minutes in production, which is very impactful and hard to work around.

Describe the solution you'd like I would like a way to build in a wait between moving to a new node so that the pods on that node can recover. This really only affects windows images on windows nodes as the delay is mostly around the size of the image itself.

Describe alternatives you've considered We have reviewed the node surge upgrade features, but the default settings are already the slowest that we can see. If we bump it higher it will just be more aggressive in taking more nodes down the moment the previous nodes are healthy from a k8s perspective.

Additional context Add any other context or screenshots about the feature request here.

valencetech commented 1 year ago

Good to have feature, critical for workloads that require time to stabilize i.e. Zookeeper, Kafka

bbgobie commented 1 year ago

I think this is a useful feature, nodes might come up, but need time to balance, process, vote, etc for anything thats clustered.

tepley commented 1 year ago

As a workaround we are using pod disruption budgets in situations that need this.