node-drainer isn't aware of LoadBalancer inflight requests causing 502s

fincd-aws commented 4 years ago

As pointed out by the comment below from the kubernetes/autoscaler repo, this and other lifecycle hooks are very naive in assuming that the instance can be terminated after checking the node's non-DaemonSet pods are gone. kube-proxy may still be proxying connections to other nodes.

There should be a k8s action when the cloud controller deregisters the node from all of the ELBs that we can cause or wait for, before completing the lifecycle hook. The less sophisticated alternative would be accepting a maximum draining wait value that the operator would have to keep updated beyond the ELB default of 300 seconds.

Most drain techniques won't remove daemonset pods, so kube-proxy should continue to run until the instance itself is gone, which in my testing continues to allow it to side-route traffic to pods. There are myriad node draining rigs out there, and it appears "EKS Node Groups" come stock with yet another one. We can use this one, github.com/aws-samples/amazon-k8s-node-drainer to illustrate what we found to be the problem.

https://github.com/aws-samples/amazon-k8s-node-drainer/blob/master/drainer/handler.py#L149
        cordon_node(v1, node_name)

        remove_all_pods(v1, node_name)

        asg.complete_lifecycle_action(LifecycleHookName=lifecycle_hook_name,
                                      AutoScalingGroupName=auto_scaling_group_name,
                                      LifecycleActionResult='CONTINUE',
                                      InstanceId=instance_id)
That is the basic runbook in the handler for the drain. A procedure like this would fire once CA has decided the node should be removed, so CA has already evicted most of the pods itself. asg.complete_lifecycle_action will actually destroy the instance. This is a very common and sane thing to do, and is similar to what we had in place; but it will still drop connections during scale-down with servicetype loadbalancer. Because, CA evicts most/all pods (but not kube-proxy) and eventually will terminate the node by calling the correct hook which will allow draining rigs to kick into gear. Taking the above code, the draining would look like:

cordon_node is called, which marks the node unschedulable (@Jeffwan 's PR will do this earlier, as part of the CA process, which may almost completely solve this issue if that merges). When the node is marked unschedulable k8s will START to remove it from all ELBs it was added to, honoring any drain settings. Remember this node may be actively holding connections EVEN THOUGH it may have nothing but kube-proxy running, because of how servicetype loadbalancer works. Cordon is non-blocking and will return virtually immediately.

remove_all_pods is called, which should evict all pods that aren't daemonsets. Again, this should leave kube-proxy running and still allow the node to side-route traffic to pods. This will likely run very quickly, or immediately, because CA has likely already evicted the pods before this chain of events starts.

asg.complete_lifecycle_action is called telling AWS it can actually destroy the node itself, which will stop kube-proxy (obviously) breaking any connections still routing through kube-proxy.

The issue is it's probably not safe to actually stop the node just because all pods have been evicted. cordon_node is non-blocking, and only signals that k8s should start the process of removing the nodes from the elbs, but doesn't wait (and shouldn't) until the nodes are actually removed from the ELBs. In our case, we have a 300s elb drain configured, so we should wait at least 300 seconds after cordon_node before terminating the node with asg.complete_lifecycle_action. Our solution was to add logic between remove_all_pods and asg.complete_lifecycle_action. Our logic right now is to make sure we've slept at least as long as our longest ELB drain after calling cordon and before calling asg.complete_lifecycle_action. We plan to add an actual check to make sure k8s has removed the instance from all ELBs on its own before subsequently calling the lifecycle hook, rather than relying on an arbitrary sleep. A nearly arbitrary sleep is, however, the kubernetes way. Most of these drain procedures aren't dealing with the fact that the node is possibly still handling production traffic when all pods, save for kube-proxy and daemonsets, are gone.

Originally posted by @bshelton229 in https://github.com/kubernetes/autoscaler/issues/1907#issuecomment-561945989

svozza commented 4 years ago

There should be a k8s action when the cloud controller deregisters the node from all of the ELBs that we can cause or wait for, before completing the lifecycle hook.

I'll be honest I don't really know what any of that means (I've not used k8s since I published this tool), can you elaborate?

The less sophisticated alternative would be accepting a maximum draining wait value that the operator would have to keep updated beyond the ELB default of 300 seconds.

Do we mean some sort of environment variable that the user can configure when installing which tells it to wait the specified amount of time?

fincd-aws commented 4 years ago

Yes, some user-configurable wait between remove_all_pods() and asg.complete_lifecycle_action().

ELB default draining time is normally 300 seconds, so we can default to that.

project0 commented 2 years ago

I guess this is caused by https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1719, so not a particular problem of the node drainer itself.

aws-samples / amazon-k8s-node-drainer

node-drainer isn't aware of LoadBalancer inflight requests causing 502s #31