Open tepley opened 1 year ago
Good to have feature, critical for workloads that require time to stabilize i.e. Zookeeper, Kafka
I think this is a useful feature, nodes might come up, but need time to balance, process, vote, etc for anything thats clustered.
As a workaround we are using pod disruption budgets in situations that need this.
Action required from @Azure/aks-pm
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Is your feature request related to a problem? Please describe. Whenever we upgrade aks to a new k8s version that has a windows nodepool the upgrade replaces nodes faster than the pods can recover in. This usually results in a total complete outage of the entire cluster for 20 minutes in production, which is very impactful and hard to work around.
Describe the solution you'd like I would like a way to build in a wait between moving to a new node so that the pods on that node can recover. This really only affects windows images on windows nodes as the delay is mostly around the size of the image itself.
Describe alternatives you've considered We have reviewed the node surge upgrade features, but the default settings are already the slowest that we can see. If we bump it higher it will just be more aggressive in taking more nodes down the moment the previous nodes are healthy from a k8s perspective.
Additional context Add any other context or screenshots about the feature request here.