Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 308 forks source link

[BUG] nodepool upgrade not waiting for temporary node to be ready #4340

Open robindv opened 5 months ago

robindv commented 5 months ago

Describe the bug I recently upgraded three AKS clusters from 1.29.2 to 1.29.4 and I noticed the temporary node that is added to the pool before the upgrade only becomes ready after the first node has been removed from the cluster.

In the past the upgrade only began when the extra temporary node was ready without any taints. Because of this changed behavior, there is a moment (~one minute) where on a single-node nodepool no node is available.

To Reproduce Steps to reproduce the behavior:

  1. Upgrade a node pool from 1.29.2 to 1.29.4 using the Azure Portal
  2. Watch the nodes using a tool like k9s

Expected behavior The temporary node is added to the nodepool and ready without any taints before the upgrade of the other nodes start.

Environment (please complete the following information):

paulgmiller commented 2 months ago

In my institutional memory we've never actually waited on nodes to be ready because readiness can flap. Instead we wait for CSE to return which just garuntees's node registration. But we may have actually made this more visible/variable by not having nodes mark themselves as ready until cni is actually ready to process.

Checking what k watch nodes looks like with kubenet vs overlay @tyler-lloyd for fun.

paulgmiller commented 2 months ago

Here's a pretty boring vanilla cluster. Are you using anythign interesting on your nodes or your network setup?

-> % k get nodes -w | /usr/bin/ts

Sep 10 06:11:31 NAME STATUS ROLES AGE VERSION Sep 10 06:11:31 aks-nodepool1-26445000-vmss000000 Ready 9m8s v1.29.7 Sep 10 06:11:31 aks-nodepool1-26445000-vmss000001 Ready 9m12s v1.29.7 Sep 10 06:11:31 aks-nodepool1-26445000-vmss000002 Ready 9m2s v1.29.7 Sep 10 06:11:42 aks-nodepool1-26445000-vmss000002 Ready 9m13s v1.29.7 Sep 10 06:12:01 aks-nodepool1-26445000-vmss000000 Ready 9m38s v1.29.7 Sep 10 06:12:01 aks-nodepool1-26445000-vmss000001 Ready 9m42s v1.29.7 Sep 10 06:12:01 aks-nodepool1-26445000-vmss000002 Ready 9m32s v1.29.7 Sep 10 06:12:21 aks-nodepool1-26445000-vmss000001 Ready 10m v1.29.7 Sep 10 06:13:05 aks-nodepool1-26445000-vmss000003 NotReady 0s v1.30.3 Sep 10 06:13:05 aks-nodepool1-26445000-vmss000003 NotReady 0s v1.30.3 Sep 10 06:13:05 aks-nodepool1-26445000-vmss000003 NotReady 0s v1.30.3 Sep 10 06:13:05 aks-nodepool1-26445000-vmss000003 NotReady 0s v1.30.3 Sep 10 06:13:06 aks-nodepool1-26445000-vmss000003 NotReady 1s v1.30.3 Sep 10 06:13:06 aks-nodepool1-26445000-vmss000003 Ready 1s v1.30.3 Sep 10 06:13:06 aks-nodepool1-26445000-vmss000003 Ready 1s v1.30.3 Sep 10 06:13:08 aks-nodepool1-26445000-vmss000003 Ready 3s v1.30.3 Sep 10 06:13:09 aks-nodepool1-26445000-vmss000003 Ready 4s v1.30.3 Sep 10 06:13:11 aks-nodepool1-26445000-vmss000003 Ready 6s v1.30.3 Sep 10 06:13:24 aks-nodepool1-26445000-vmss000000 Ready,SchedulingDisabled 11m v1.29.7 Sep 10 06:13:24 aks-nodepool1-26445000-vmss000000 Ready,SchedulingDisabled 11m v1.29.7 Sep 10 06:13:28 aks-nodepool1-26445000-vmss000003 Ready 23s v1.30.3 Sep 10 06:13:28 aks-nodepool1-26445000-vmss000003 Ready 23s v1.30.3

microsoft-github-policy-service[bot] commented 1 month ago

Action required from @aritraghosh, @julia-yin, @AllenWen-at-Azure

paulgmiller commented 3 weeks ago

Had another customer bring this up with regard to the following taint "node.cloudprovider.kubernetes.io/uninitialized" still being on replacement/surge nodes when an upgrade was drained.

AKS will never be perfect in this regards as readiness and taints are dynamic and can come at any time. Your best line of defense for critical applications is define pod disruption budgets as that will block drains/evictions regardless of the reason the new pod can't come up whether its due to node, pod itself or otherwise.

ASK could be better here though we could start with a whitelisted set of node conditions and taints that we know are likely to occur at startup even if only briefly and wait some amount of time T for them all top clear once (ignoring if they come back)

Using all conditions / taints is hard as some make flap and some may be applied by customer.

Generally we haven't invested in this because we don't see it that often (though looking into the data there), its not trivial to orchestrate and you probably want PDBs anyways.

Please upvote if you'd like to see improvements in this area.

robindv commented 3 weeks ago

Thanks for the suggestion, I'll have a closer look to the pod disruption budgets to arm myself against this behaviour :-)