Open JoeyC-Dev opened 6 months ago
Hey @pavneeta I spoke to you about this at KubeCon Paris. Seems someone else is also seeing the same. Is this on the roadmap to resolve/tidy up?
Hi @JoeyC-Dev i think this is expected behavior for --scale-down-mode Deallocate if you want to delete the NotReady nodes you need to use --scale-down-mode Delete
Reference : Warning In order to preserve any deallocated VMs, you must set Scale-down Mode to Deallocate. That includes VMs that have been deallocated using IaaS APIs (Virtual Machine Scale Set APIs). Setting Scale-down Mode to Delete will remove any deallocate VMs. Once applied the deallocated mode and scale down operation occurred, those nodes keep registered in APIserver and appear as NotReady state.
https://learn.microsoft.com/en-us/azure/aks/scale-down-mode#before-you-begin
@abarqawi Be real honest, this is sad. It looks like this warning
is existing for 3+ years.
Preserving nodes is still necessary from my sight: especially when need to scale out hundreds of nodes when I need it in a very short time and deallocate them when not needed. So --scale-down-mode Deallocate
is still needed.
I know the "virtual nodes" which is even capable of launching thousands of instances within short time but: Is this really acceptable as a workaround? I have my doubt on this.
The Unhealthy
status is discouraging. Maybe it is necessary to preserve them in control plane, but at least suppress the "Unhealthy" from Azure portal. Because it will make user keep ignoring the yellow bar on the top of the Azure portal and will not be aware of real matter warning appear at the same place.
For example: kubelet down, and it caused the nodes are not up. I have seen some users reported facing kubelet
down issue recently because they find out the nodes are not up. If user get used to this "Unhealthy" thing, they will eventually ignore the real issue when it comes as they thought it is still the usual thing.
Maybe some improvements can be done, for real.
Action required from @aritraghosh, @julia-yin, @AllenWen-at-Azure
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
I don't know if there is a way to vote on this requested change, but I want to add our voice in support of the frustration of not being able to use Deallocate mode on node pools without putting the AKS cluster in a forever "warning" state.
It also causes constant confusion with Azure support. Whenever we open a case regarding our AKS clusters, Azure support first says our clusters are unhealthy and point to the resource health which has one or more health event for every day that just says:
Customer Nodes Power Off (Customer Initiated)
At Thursday, September 5, 2024 at 12:35:17 AM EDT, the Azure monitoring system received the following information regarding your Azure Kubernetes Service (AKS):
Multiple nodes could not be found in running/powered on state.
This is firstly wrong because we did not initiate any node power off. It is a built-in feature of AKS!
But it also causes delays in our support resolution as we need to convince Azure support this is not a real issue with our cluster.
This is the expected behavior. Please see the doc. https://learn.microsoft.com/en-us/azure/aks/scale-down-mode
This is the expected behavior. Please see the doc. https://learn.microsoft.com/en-us/azure/aks/scale-down-mode
I disagree that the documentation you linked describes the downsides of this feature. Please reconsider.
I agree with @chandlerkent. Happy to jump on a call and explain further if needed.
Action required from @aritraghosh, @julia-yin, @AllenWen-at-Azure
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Describe the bug When node pool
--scale-down-mode
is being set to Deallocate and scale in 1 node after then, it will cause Unhealty report on Azure portal.To Reproduce
(Health warning popped up like after ~30 mins after restarting the AKS (part 2 of the script), simply wait and check Azure portal in 30 mins. Make sure the page is being fully refreshed by using
Ctrl+F5
.)Expected behavior The deallocated nodes should be removed from AKS node list.
Screenshots "The node could not be found in running/powered on state." (Why there are still 5 Pods on supposed-not-started node? I don't know.)
Environment (please complete the following information):
Additional context In my memory, this issue persists a long time ago, I just start to dig the root cause today.