PrivateDNSName not returned from ec2.describe_instances

aws-samples / amazon-k8s-node-drainer

Gracefully drain Kubernetes pods from EKS worker nodes during autoscaling scale-in events.

Other

199 stars 56 forks source link

PrivateDNSName not returned from ec2.describe_instances #24

Closed ivelichkovich closed 4 years ago

ivelichkovich commented 4 years ago

Most of the time when this runs it cant find the k8s node because the k8s node name is "".

I added some logging and the PrivateDnsName isn't returned from ec2 describe instance, or any private IP information isn't returned.

Is it possible these detach before the lifecycle hook completes?

ivelichkovich commented 4 years ago

Looks like even with this the node goes NotReady and disappears in k8s before I can even manually drain it

svozza commented 4 years ago

When I was building this I noticed that sometimes the private IP was there and sometimes it wasn't, I don't think there's a reliable way to get that information.

ivelichkovich commented 4 years ago

I labelled the nodes with instanceID in k8s and use the k8s client to get the node by instanceID even that failed sometimes. But both ways work everytime if you the node termination is triggered by the ASG i.e. scaling the ASG down otherwise the node is terminated and gone before the lambda kicks off.

So question for amazon, why do user initiated terminations in the console behave differently? Is this intentional? Why do 1/20 (not accurate number) or so of the user initiated terminations on the EC2 work "properly"? I can't imagine mixed behavior is intentional but maybe it is just a race condition but it seems like when it works, the node hangs around longer. Either way this should be more clear in the documentation of ASG lifecycle hooks unless I missed something

svozza commented 4 years ago

When you do it through the console I think it uses the bare EC2 API to terminate the instance, rather than using the ASG API. If you look at the example in the docs you'll see that we specifically use this command to test the drainer.

https://github.com/aws-samples/amazon-k8s-node-drainer#testing-the-drainer-function

aws autoscaling terminate-instance-in-auto-scaling-group --no-should-decrement-desired-capacity --instance-id <instance-id>