aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.21k stars 317 forks source link

[EKS] [bug]: auto-scaling group ends up in a bad state after `kubectl delete node` #1811

Open Aleksei-Poliakov opened 2 years ago

Aleksei-Poliakov commented 2 years ago

Community Note

Tell us about your request Currently whenever kubectl delete node command is ran in the cluster - node is removed from k8s, but the EC2 instance behind the node is not terminated. As a result AWS auto-scaling group behind k8s node group does not create new EC2 instances, which also breaks things like cluster auto-scaler.

An example would look like this:

The only way I know of how to resolve the situation is to MANUALLY find out EC2 instance that is no longer mapped to a node in k8s cluster and terminate it, then ASG would pick this information up and continue handling auto-scaler requests.

Which service(s) is this request for? EKS, ASG

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? There is no particular need to use kubectl delete node, but having this behavior in the system is very dangerous. I ended up in this situation because I wanted to get rid of nodes that seemed to be poisoned (pods running on these were performing worse than pods of same service running on all other nodes in the cluster) - it turned out the issue was totally unrelated, but in doing kubectl delete node I messed up the cluster and put it into a bad state that required a fair amount of effort to get to the bottom of.

Are you currently working around this issue? Yes, manually deleting EC2 instance is a viable workaround

Additional context You can see more details in:

nalshamaajc commented 1 year ago

Should we expect Managed Node Groups to be able to figure out that difference and cycle (terminate and create a new one) the deleted nodes?

Aleksei-Poliakov commented 1 year ago

I believe yes, the deleted node should be terminated in this scenario. To be clear - there are ways already in the ecosystem to safely remove an EC2 instance from the cluster by cordoning the node and then detaching it from the ASG; so if the user explicitly asked to delete a node - it seems totally reasonable that the EC2 instance behind it is also deleted, and most importantly the node group itself remains "healthy" (e.g. does not prevent scaling up).

nalshamaajc commented 1 year ago

yes there should, my question was more toward AWS adding this feature, which I think can be optionally enabled.

wonko commented 12 months ago

Hitting the same issue here.

For completeness, a kubectl delete node xxx on either GCP or Azure will actually terminate the backing VM as well, allowing for complete node management from within kubernetes.

davidr-bt commented 1 month ago

In our case, the culprit was a setting of ASG Desired = 1, where we had the described unwelcome behavior. It appears we did not have this behavior with Desired = 0