Closed ivelichkovich closed 4 years ago
Looks like even with this the node goes NotReady and disappears in k8s before I can even manually drain it
When I was building this I noticed that sometimes the private IP was there and sometimes it wasn't, I don't think there's a reliable way to get that information.
I labelled the nodes with instanceID in k8s and use the k8s client to get the node by instanceID even that failed sometimes. But both ways work everytime if you the node termination is triggered by the ASG i.e. scaling the ASG down otherwise the node is terminated and gone before the lambda kicks off.
So question for amazon, why do user initiated terminations in the console behave differently? Is this intentional? Why do 1/20 (not accurate number) or so of the user initiated terminations on the EC2 work "properly"? I can't imagine mixed behavior is intentional but maybe it is just a race condition but it seems like when it works, the node hangs around longer. Either way this should be more clear in the documentation of ASG lifecycle hooks unless I missed something
When you do it through the console I think it uses the bare EC2 API to terminate the instance, rather than using the ASG API. If you look at the example in the docs you'll see that we specifically use this command to test the drainer.
https://github.com/aws-samples/amazon-k8s-node-drainer#testing-the-drainer-function
aws autoscaling terminate-instance-in-auto-scaling-group --no-should-decrement-desired-capacity --instance-id <instance-id>
Most of the time when this runs it cant find the k8s node because the k8s node name is "".
I added some logging and the PrivateDnsName isn't returned from ec2 describe instance, or any private IP information isn't returned.
Is it possible these detach before the lifecycle hook completes?