Open heoelri opened 1 year ago
We just had something similar happen on eastus2
region (also using K8s 1.24.6).
Labels and taints* were correctly applied by Terraform, visible in Azure Portal and yet completely missing from Kubernetes nodes themselves.
For an affected cluster the Azure Resource browser (resource.azure.com) shows the expected node_labels
and node_taints
.
This is for me an indicator that TF is setting the labels and taints correctly. Another indicator is of course that it only occurs in swedencentral
(in our case - @eplightning reported a case in eastus2
as well) not in other regions and that it happens only randomly, not every night.
Triggering a reimage via az vmss reimage
seems to fix nodes that are lacking the labels and taints.
And we have had another instance of this bug happening, this time in northeurope
- region where before we had no issues with this.
Probably no region is safe, just for some its more likely to happen.
Action required from @Azure/aks-pm
I am observing the same issue in switzerlandnorth
since a few days. I have seen the issue using both k8s version 1.24.6 and 1.24.3. I am creating a nodepool using the upbound crossplane provider which, to the best of my knowledge, is using Terraform under the hood. The node labels that I requested show up in the Azure portal, again suggesting Terraform is doing what it is supposed to do, but when inspecting the nodes using kubectl describe nodes
, the labels are missing. This issue happens fairly infrequently but is fairly catastrophic for our app when it does happen.
We're still facing the same issue this time in australiaeast
.
Also facing this issue in westeurope at the moment
⚡ k describe node aks-generalpool-32674150-vmss000001
Name: aks-generalpool-32674150-vmss000001
Roles: <none>
Labels: agentpool=generalpool
beta.kubernetes.io/arch=arm64
beta.kubernetes.io/instance-type=Standard_D16ps_v5
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=westeurope
failure-domain.beta.kubernetes.io/zone=0
kubernetes.azure.com/mode=system
kubernetes.azure.com/node-image-version=AKSUbuntu-2204gen2arm64containerd-202409.30.0
kubernetes.azure.com/nodepool-type=VirtualMachineScaleSets
kubernetes.azure.com/os-sku=Ubuntu
kubernetes.azure.com/role=agent
kubernetes.azure.com/storageprofile=managed
kubernetes.azure.com/storagetier=Premium_LRS
kubernetes.io/arch=arm64
kubernetes.io/hostname=aks-generalpool-32674150-vmss000001
kubernetes.io/os=linux
node.kubernetes.io/instance-type=Standard_D16ps_v5
storageprofile=managed
storagetier=Premium_LRS
topology.disk.csi.azure.com/zone=
topology.kubernetes.io/region=westeurope
topology.kubernetes.io/zone=0
AKS running version 1.29.x and it was built with the Terraform provider v 4.5.0, although I doubt it is a problem in Terraform?
For anyone reaching this issue, I discovered that by forcing the node pools into specific zones via Terraform, e.g. setting zones
argument, the topology.kubernetes.io/zone labels are populated correctly 🎉
Describe the bug We're deploying AKS with K8s
1.24.6
via Terraform on a nightly basis inswedencentral
(and other regions) and noticed multiple times already that the deployment inswedencentral
fails due to missing custom labels and taints. The deployment is done via Terraform and is the same across all regions, though onlyswedencentral
is affected by this issue.To Reproduce Hard to reproduce as this occurs on a (as it seems) random basis. Steps to reproduce the behavior:
node_labels
andnode_taints
:https://github.com/Azure/Mission-Critical-Online/blob/ca62fa790d3900250991e93182461d98e4f2b657/src/infra/workload/releaseunit/modules/stamp/kubernetes.tf#L91-L97
Expected behavior All nodes are expected to consistently come up with the defined set of labels and taints.
Screenshots The nodes in
swedencentral
lack (sometimes, not always) the labels as well as the taints:Environment (please complete the following information):
Additional context nothing