Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.96k stars 306 forks source link

[BUG] Randomly missing node labels in `swedencentral` #3373

Open heoelri opened 1 year ago

heoelri commented 1 year ago

Describe the bug We're deploying AKS with K8s 1.24.6 via Terraform on a nightly basis in swedencentral (and other regions) and noticed multiple times already that the deployment in swedencentral fails due to missing custom labels and taints. The deployment is done via Terraform and is the same across all regions, though only swedencentral is affected by this issue.

To Reproduce Hard to reproduce as this occurs on a (as it seems) random basis. Steps to reproduce the behavior:

  1. Deploy AKS (with K8s 1.24.6) via Terraform
  2. Deploy a user node pool with the following node_labels and node_taints:
node_labels = {
  "role" = "workload"
}

node_taints = [
   "workload=true:NoSchedule"
]

https://github.com/Azure/Mission-Critical-Online/blob/ca62fa790d3900250991e93182461d98e4f2b657/src/infra/workload/releaseunit/modules/stamp/kubernetes.tf#L91-L97

Expected behavior All nodes are expected to consistently come up with the defined set of labels and taints.

Screenshots The nodes in swedencentral lack (sometimes, not always) the labels as well as the taints:

Name:               aks-workloadpool-14020145-vmss000001
Roles:              agent
Labels:             agentpool=workloadpool
                    beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=Standard_F8s_v2
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=swedencentral
                    failure-domain.beta.kubernetes.io/zone=swedencentral-3
                    kubernetes.azure.com/agentpool=workloadpool
                    kubernetes.azure.com/cluster=MC_xxxx
                    kubernetes.azure.com/kubelet-identity-client-id=xxxx46be-xxxx-4d75-xxxx-a5d9b5fcea27
                    kubernetes.azure.com/mode=user
                    kubernetes.azure.com/node-image-version=AKSUbuntu-1804gen2containerd-2022.11.02
                    kubernetes.azure.com/os-sku=Ubuntu
                    kubernetes.azure.com/role=agent
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=aks-workloadpool-14020145-vmss000001
                    kubernetes.io/os=linux
                    kubernetes.io/role=agent
                    node-role.kubernetes.io/agent=
                    node.kubernetes.io/instance-type=Standard_F8s_v2
                    topology.disk.csi.azure.com/zone=swedencentral-3
                    topology.kubernetes.io/region=swedencentral
                    topology.kubernetes.io/zone=swedencentral-3
Taints:             <none>

Environment (please complete the following information):

Additional context nothing

eplightning commented 1 year ago

We just had something similar happen on eastus2 region (also using K8s 1.24.6).

Labels and taints* were correctly applied by Terraform, visible in Azure Portal and yet completely missing from Kubernetes nodes themselves.

heoelri commented 1 year ago

For an affected cluster the Azure Resource browser (resource.azure.com) shows the expected node_labels and node_taints.

image

This is for me an indicator that TF is setting the labels and taints correctly. Another indicator is of course that it only occurs in swedencentral (in our case - @eplightning reported a case in eastus2 as well) not in other regions and that it happens only randomly, not every night.

Triggering a reimage via az vmss reimage seems to fix nodes that are lacking the labels and taints.

eplightning commented 1 year ago

And we have had another instance of this bug happening, this time in northeurope - region where before we had no issues with this.

Probably no region is safe, just for some its more likely to happen.

ghost commented 1 year ago

Action required from @Azure/aks-pm

tdoublep commented 1 year ago

I am observing the same issue in switzerlandnorth since a few days. I have seen the issue using both k8s version 1.24.6 and 1.24.3. I am creating a nodepool using the upbound crossplane provider which, to the best of my knowledge, is using Terraform under the hood. The node labels that I requested show up in the Azure portal, again suggesting Terraform is doing what it is supposed to do, but when inspecting the nodes using kubectl describe nodes, the labels are missing. This issue happens fairly infrequently but is fairly catastrophic for our app when it does happen.

heoelri commented 1 year ago

We're still facing the same issue this time in australiaeast.

inge4pres commented 2 days ago

Also facing this issue in westeurope at the moment

⚡  k describe node aks-generalpool-32674150-vmss000001                                                       
Name:               aks-generalpool-32674150-vmss000001
Roles:              <none>
Labels:             agentpool=generalpool
                    beta.kubernetes.io/arch=arm64
                    beta.kubernetes.io/instance-type=Standard_D16ps_v5
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=westeurope
                    failure-domain.beta.kubernetes.io/zone=0
                    kubernetes.azure.com/mode=system
                    kubernetes.azure.com/node-image-version=AKSUbuntu-2204gen2arm64containerd-202409.30.0
                    kubernetes.azure.com/nodepool-type=VirtualMachineScaleSets
                    kubernetes.azure.com/os-sku=Ubuntu
                    kubernetes.azure.com/role=agent
                    kubernetes.azure.com/storageprofile=managed
                    kubernetes.azure.com/storagetier=Premium_LRS
                    kubernetes.io/arch=arm64
                    kubernetes.io/hostname=aks-generalpool-32674150-vmss000001
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=Standard_D16ps_v5
                    storageprofile=managed
                    storagetier=Premium_LRS
                    topology.disk.csi.azure.com/zone=
                    topology.kubernetes.io/region=westeurope
                    topology.kubernetes.io/zone=0

AKS running version 1.29.x and it was built with the Terraform provider v 4.5.0, although I doubt it is a problem in Terraform?

inge4pres commented 2 days ago

For anyone reaching this issue, I discovered that by forcing the node pools into specific zones via Terraform, e.g. setting zones argument, the topology.kubernetes.io/zone labels are populated correctly 🎉