Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.96k stars 306 forks source link

[BUG] ClusterAutoscaler blocked on unhealthy nodes, caused by Spot preempted nodes were not removed from API server #4420

Open grzesuav opened 3 months ago

grzesuav commented 3 months ago

Describe the bug Cluster autoscaler was blocked as cluster state was unhealthy. Cluster state was unhealthy as more than 30% of nodes were unhealthy. All of the unhealthy nodes were just dummy entries in API server, they were preempted spot instances. From some reasons objects in API server were not deleted.

To Reproduce N/A

Expected behavior Spot node is removed from API server after preemption

Screenshots N/A

Environment (please complete the following information):

Additional context It happened first time for me, in eastus2 region

kevinkrp93 commented 2 months ago

The bug fix for this has been rolled out

grzesuav commented 2 months ago

@kevinkrp93 can you elaborate a bit in which version is it fixed ? And share some details how it was fixed actually?

grzesuav commented 1 month ago

team, can we get details in how the fix was delivered to confirm the issue is actually in our clusters ?

kevinkrp93 commented 1 month ago

@grzesuav - It was a bug in compute which we fixed. Are you still facing this issue?