Open acegrader33 opened 2 years ago
it took us quite a while to track down why we were seeing an outage during these events
Thank you for using your time to track down the issue and publish it, this issue description saved our team a lot of time and effort as we were investigating an issue with similar symptoms.
We'll try to contact our AWS contacts about this issue.
We also ran into this issue and could find out, that the node draining is aborted after ~ 7-10 minutes. After that, EKS simply force deletes the EC2 instances and the node group(s). Pods with a restrictive PDB or a longer time to become ready cannot gracefully move to the new nodes. And even worth: You cannot configure how many nodes may be drained at the same time. So EKS simply drains and deletes all nodes of a group at once!
Since we use NLB with IP target in front of our NGINX ingress controller Pods (the entry points to our services), this behavior has a great impact on the whole cluster! Even if we configure a loose PDB with maxUnvailable 25%, the time until node draining is aborted, is too less to enable the Pods to move to the new nodes in compliance with the required replica count: The new NGINX Pods need ~4-5 minutes to become healthy and receive traffic on the new nodes due to an NLB health check issue (see https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1834). So in worst case we only have 25% of our original replica count up and running on the new nodes before the old nodes are deleted.
Our workaround - if we cannot avoid changes that lead to node group recreations - for the moment is (our EKS is managed with Terraform):
terraform apply
so that new node groups are createdeksctl delete nodegroup --name <nodegroup> --cluster <cluster-name> --timeout 60m --parallel 1
to delete your old node groups gracefully! eksctl
cordons all nodes at a time (scheduling disabled), but unlike the previously observed behaviour, draining can take place with only one node at a time (see --parallel
parameter) and a user-defined timeout (see --timeout
parameter)! You can also drain nodes manually via kubectl
but eksctl
does this work for us very well.For the future I wish some configuration parameters for managed node group deletions, analogous to the current "Node group update configuration", where I can configure something like maxUnvailable
. Furthermore I'd like to configure a custom timeout for node draining. In total I don't want to have manual work or use 3rd party tools for standard maintenance tasks like the change of the instance types!
We also ran into this issue and could find out, that the node draining is aborted after ~ 7-10 minutes. After that, EKS simply force deletes the EC2 instances and the node group(s). Pods with a restrictive PDB or a longer time to become ready cannot gracefully move to the new nodes. And even worth: You cannot configure how many nodes may be drained at the same time. So EKS simply drains and deletes all nodes of a group at once!
Since we use NLB with IP target in front of our NGINX ingress controller Pods (the entry points to our services), this behavior has a great impact on the whole cluster! Even if we configure a loose PDB with maxUnvailable 25%, the time until node draining is aborted, is too less to enable the Pods to move to the new nodes in compliance with the required replica count: The new NGINX Pods need ~4-5 minutes to become healthy and receive traffic on the new nodes due to an NLB health check issue (see kubernetes-sigs/aws-load-balancer-controller#1834). So in worst case we only have 25% of our original replica count up and running on the new nodes before the old nodes are deleted.
Our workaround - if we cannot avoid changes that lead to node group recreations - for the moment is (our EKS is managed with Terraform):
- Delete affected node groups from TF state so that they are not automatically deleted when applying the change
- Run
terraform apply
so that new node groups are created- Set desired size of the new node groups to size of the "old" node groups to create sufficient space for workload movement
- Use
eksctl delete nodegroup --name <nodegroup> --cluster <cluster-name> --timeout 60m --parallel 1
to delete your old node groups gracefully!eksctl
cordons all nodes at a time (scheduling disabled), but unlike the previously observed behaviour, draining can take place with only one node at a time (see--parallel
parameter) and a user-defined timeout (see--timeout
parameter)! You can also drain nodes manually viakubectl
buteksctl
does this work for us very well.For the future I wish some configuration parameters for managed node group deletions, analogous to the current "Node group update configuration", where I can configure something like
maxUnvailable
. Furthermore I'd like to configure a custom timeout for node draining. In total I don't want to have manual work or use 3rd party tools for standard maintenance tasks like the change of the instance types!
We have the same exact issue. With AWS no matter what solution you choose, there is one thing broken :(
We are meeting exactly the same issue as @acegrader33 and @siku4 mentioned.
role=ingress
that is running two pods of Istio Ingress, and each of them needs 5 minutes for graceful shutdownreplace
behavior that will replace the Node Group:
Firstly, the undefined "few minutes" should be 5-7 minutes, as I have watched and reproduced many times. Unluckily, my two Istio Ingress pods need a total 10 mins (5 mins x 2) for graceful shutdown, so the 1st Pod is okay, but the 2nd Pod will be forced terminated and cause many 504 errors on our API Gateway.
Secondly, I already tested all the possible solutions that may affect the forced shutdown
like Node Termination Handler, stop protection, terminate protection, turn off force upgrade, etc... But all seems no luck, the Node will be forced terminated after 5 mins.
So our workaround is also the same as @acegrader33, avoiding any upgrades that will cause the replace
behavior of Terraform, like changing the instance type. If we really need to do so, we will:
We are in the same scenario with Strimzi managed Kafka clusters, this might leads us to offline partitions when the replica lag is big enough on the already moved brokers.
We are using the manual approach for the moment, not ideal.
Community Note
Tell us about your request When upgrading an EKS Managed Node Group, there is an option for a "rolling update" or a "force update". During a "rolling update" if a node fails to drain completely, the update fails. Draining might fail because of PodDisruptionBudget configurations, post hooks, or other cluster/pod/deployment settings. Sometimes this is desirable, sometimes it is not desirable and so the "force update" option allows the upgrade to continue despite these drain failures.
However, when deleting a managed node group, there is no similar option. Once a deletion is started, there is no stopping it (essentially the "force" behavior is the only behavior). Much like upgrades though, there are situations where we want the deletion to fail if a node fails to drain completely. We request that this option be added for managed node group deletion, and for it to be the default behavior similar to upgrades.
Which service(s) is this request for? EKS
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? When we make changes to EKS Managed Node Groups such as updating the instance types we are using, the process creates a new managed node group and then deletes the old managed node group. We've had multiple occasions where this causes an outage.
In some cases, draining the old nodes took much longer than the undefined "few minutes" which EKS waits for the drain to finish, generally because of PodDisruptionBudgets which we have in place to ensure availability of our applications. In another case, the new managed node group was unable to bring up sufficient instances to allow all pods to reschedule, essentially making the cluster unusable until we could undo the change.
Are you currently working around this issue? We are avoiding updates that would require us to create and then delete a managed node group, which is the main situation in which we would want the deletion to fail. If we need to do this though, we will have to 1) create the new node group, 2) cordon off the old node group, 3) manually drain all the nodes, 4) delete the old node group assuming the drain was successful. This manual process is very undesirable for us, especially in our larger clusters. We might consider going back to managing autoscaling groups ourselves, and use lambda functions/lifecycle hooks to ensure this behavior as we did before adopting managed node groups.
Additional context Because the managed node group deletion behavior is totally different from the managed node group upgrade behavior, it took us quite a while to track down why we were seeing an outage during these events. I believe this is a bug, since it differs significantly from the behavior of upgrades and does not respect the resiliency and availability settings that Kubernetes allows us to use to prevent outages of this type (i.e. PodDisruptionBudgets).
Attachments