Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 311 forks source link

[BUG] Deallocated nodes are being reported as "Unhealthy" #4313

Open JoeyC-Dev opened 6 months ago

JoeyC-Dev commented 6 months ago

Describe the bug When node pool --scale-down-mode is being set to Deallocate and scale in 1 node after then, it will cause Unhealty report on Azure portal.

To Reproduce

# Basic parameter for set-up
aks=
rG=
location=southeastasia

# Create resource group
az group create -n ${rG} -l ${location}

# Set `--node-os-upgrade-channel` to None in case cannot scale/change nodepool properties later
az aks create -n ${aks} -g ${rG} \
--nodepool-name agentpool \
--node-count 2 \
--node-os-upgrade-channel None \
--no-ssh-key

# Set to Deallocate for reproducing
az aks nodepool update --name agentpool --cluster-name ${aks} -g ${rG} \
--scale-down-mode Deallocate

# Reproducing issue by scaling in one node
az aks nodepool scale --name agentpool --cluster-name ${aks} -g ${rG} \
--node-count 1

# Giving time for AKS to fully processing
sleep 120

# Checking node ready status
# Issue reproduced here
az aks get-credentials -n ${aks} -g ${rG}
kubectl get node
# Try if stop/start fix the issue 
# Stop AKS 
az aks stop -n ${aks} -g ${rG}
# Start AKS
az aks start -n ${aks} -g ${rG}

# Issue persists
kubectl get node

(Health warning popped up like after ~30 mins after restarting the AKS (part 2 of the script), simply wait and check Azure portal in 30 mins. Make sure the page is being fully refreshed by using Ctrl+F5.)

Expected behavior The deallocated nodes should be removed from AKS node list.

Screenshots image image image "The node could not be found in running/powered on state." image image image image (Why there are still 5 Pods on supposed-not-started node? I don't know.) image

Environment (please complete the following information):

Additional context In my memory, this issue persists a long time ago, I just start to dig the root cause today.

PixelRobots commented 6 months ago

Hey @pavneeta I spoke to you about this at KubeCon Paris. Seems someone else is also seeing the same. Is this on the roadmap to resolve/tidy up?

abarqawi commented 6 months ago

Hi @JoeyC-Dev i think this is expected behavior for --scale-down-mode Deallocate if you want to delete the NotReady nodes you need to use --scale-down-mode Delete

Reference : Warning In order to preserve any deallocated VMs, you must set Scale-down Mode to Deallocate. That includes VMs that have been deallocated using IaaS APIs (Virtual Machine Scale Set APIs). Setting Scale-down Mode to Delete will remove any deallocate VMs. Once applied the deallocated mode and scale down operation occurred, those nodes keep registered in APIserver and appear as NotReady state.

https://learn.microsoft.com/en-us/azure/aks/scale-down-mode#before-you-begin

JoeyC-Dev commented 6 months ago

@abarqawi Be real honest, this is sad. It looks like this warning is existing for 3+ years. Preserving nodes is still necessary from my sight: especially when need to scale out hundreds of nodes when I need it in a very short time and deallocate them when not needed. So --scale-down-mode Deallocate is still needed.

I know the "virtual nodes" which is even capable of launching thousands of instances within short time but: image Is this really acceptable as a workaround? I have my doubt on this.

The Unhealthy status is discouraging. Maybe it is necessary to preserve them in control plane, but at least suppress the "Unhealthy" from Azure portal. Because it will make user keep ignoring the yellow bar on the top of the Azure portal and will not be aware of real matter warning appear at the same place. For example: kubelet down, and it caused the nodes are not up. I have seen some users reported facing kubelet down issue recently because they find out the nodes are not up. If user get used to this "Unhealthy" thing, they will eventually ignore the real issue when it comes as they thought it is still the usual thing.

Maybe some improvements can be done, for real.

microsoft-github-policy-service[bot] commented 5 months ago

Action required from @aritraghosh, @julia-yin, @AllenWen-at-Azure

microsoft-github-policy-service[bot] commented 4 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 4 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 3 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 2 months ago

Issue needing attention of @Azure/aks-leads

chandlerkent commented 2 months ago

I don't know if there is a way to vote on this requested change, but I want to add our voice in support of the frustration of not being able to use Deallocate mode on node pools without putting the AKS cluster in a forever "warning" state.

It also causes constant confusion with Azure support. Whenever we open a case regarding our AKS clusters, Azure support first says our clusters are unhealthy and point to the resource health which has one or more health event for every day that just says:

Customer Nodes Power Off (Customer Initiated)
At Thursday, September 5, 2024 at 12:35:17 AM EDT, the Azure monitoring system received the following information regarding your Azure Kubernetes Service (AKS):
Multiple nodes could not be found in running/powered on state.

This is firstly wrong because we did not initiate any node power off. It is a built-in feature of AKS!

But it also causes delays in our support resolution as we need to convince Azure support this is not a real issue with our cluster.

xuexu6666 commented 2 months ago

This is the expected behavior. Please see the doc. https://learn.microsoft.com/en-us/azure/aks/scale-down-mode

image
chandlerkent commented 2 months ago

This is the expected behavior. Please see the doc. https://learn.microsoft.com/en-us/azure/aks/scale-down-mode

image

I disagree that the documentation you linked describes the downsides of this feature. Please reconsider.

PixelRobots commented 2 months ago

I agree with @chandlerkent. Happy to jump on a call and explain further if needed.

microsoft-github-policy-service[bot] commented 1 month ago

Action required from @aritraghosh, @julia-yin, @AllenWen-at-Azure

microsoft-github-policy-service[bot] commented 1 month ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 3 weeks ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 6 days ago

Issue needing attention of @Azure/aks-leads