aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.22k stars 321 forks source link

[EKS] [request]: Managed Node Group Deletion Fails if Node Drain Fails #1636

Open acegrader33 opened 2 years ago

acegrader33 commented 2 years ago

Community Note

Tell us about your request When upgrading an EKS Managed Node Group, there is an option for a "rolling update" or a "force update". During a "rolling update" if a node fails to drain completely, the update fails. Draining might fail because of PodDisruptionBudget configurations, post hooks, or other cluster/pod/deployment settings. Sometimes this is desirable, sometimes it is not desirable and so the "force update" option allows the upgrade to continue despite these drain failures.

However, when deleting a managed node group, there is no similar option. Once a deletion is started, there is no stopping it (essentially the "force" behavior is the only behavior). Much like upgrades though, there are situations where we want the deletion to fail if a node fails to drain completely. We request that this option be added for managed node group deletion, and for it to be the default behavior similar to upgrades.

Which service(s) is this request for? EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? When we make changes to EKS Managed Node Groups such as updating the instance types we are using, the process creates a new managed node group and then deletes the old managed node group. We've had multiple occasions where this causes an outage.

In some cases, draining the old nodes took much longer than the undefined "few minutes" which EKS waits for the drain to finish, generally because of PodDisruptionBudgets which we have in place to ensure availability of our applications. In another case, the new managed node group was unable to bring up sufficient instances to allow all pods to reschedule, essentially making the cluster unusable until we could undo the change.

Are you currently working around this issue? We are avoiding updates that would require us to create and then delete a managed node group, which is the main situation in which we would want the deletion to fail. If we need to do this though, we will have to 1) create the new node group, 2) cordon off the old node group, 3) manually drain all the nodes, 4) delete the old node group assuming the drain was successful. This manual process is very undesirable for us, especially in our larger clusters. We might consider going back to managing autoscaling groups ourselves, and use lambda functions/lifecycle hooks to ensure this behavior as we did before adopting managed node groups.

Additional context Because the managed node group deletion behavior is totally different from the managed node group upgrade behavior, it took us quite a while to track down why we were seeing an outage during these events. I believe this is a bug, since it differs significantly from the behavior of upgrades and does not respect the resiliency and availability settings that Kubernetes allows us to use to prevent outages of this type (i.e. PodDisruptionBudgets).

Attachments

henkka commented 2 years ago

it took us quite a while to track down why we were seeing an outage during these events

Thank you for using your time to track down the issue and publish it, this issue description saved our team a lot of time and effort as we were investigating an issue with similar symptoms.

We'll try to contact our AWS contacts about this issue.

siku4 commented 2 years ago

We also ran into this issue and could find out, that the node draining is aborted after ~ 7-10 minutes. After that, EKS simply force deletes the EC2 instances and the node group(s). Pods with a restrictive PDB or a longer time to become ready cannot gracefully move to the new nodes. And even worth: You cannot configure how many nodes may be drained at the same time. So EKS simply drains and deletes all nodes of a group at once!

Since we use NLB with IP target in front of our NGINX ingress controller Pods (the entry points to our services), this behavior has a great impact on the whole cluster! Even if we configure a loose PDB with maxUnvailable 25%, the time until node draining is aborted, is too less to enable the Pods to move to the new nodes in compliance with the required replica count: The new NGINX Pods need ~4-5 minutes to become healthy and receive traffic on the new nodes due to an NLB health check issue (see https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1834). So in worst case we only have 25% of our original replica count up and running on the new nodes before the old nodes are deleted.

Our workaround - if we cannot avoid changes that lead to node group recreations - for the moment is (our EKS is managed with Terraform):

  1. Delete affected node groups from TF state so that they are not automatically deleted when applying the change
  2. Run terraform apply so that new node groups are created
  3. Set desired size of the new node groups to size of the "old" node groups to create sufficient space for workload movement
  4. Use eksctl delete nodegroup --name <nodegroup> --cluster <cluster-name> --timeout 60m --parallel 1 to delete your old node groups gracefully! eksctl cordons all nodes at a time (scheduling disabled), but unlike the previously observed behaviour, draining can take place with only one node at a time (see --parallel parameter) and a user-defined timeout (see --timeout parameter)! You can also drain nodes manually via kubectl but eksctl does this work for us very well.

For the future I wish some configuration parameters for managed node group deletions, analogous to the current "Node group update configuration", where I can configure something like maxUnvailable. Furthermore I'd like to configure a custom timeout for node draining. In total I don't want to have manual work or use 3rd party tools for standard maintenance tasks like the change of the instance types!

chathsuom commented 2 years ago

We also ran into this issue and could find out, that the node draining is aborted after ~ 7-10 minutes. After that, EKS simply force deletes the EC2 instances and the node group(s). Pods with a restrictive PDB or a longer time to become ready cannot gracefully move to the new nodes. And even worth: You cannot configure how many nodes may be drained at the same time. So EKS simply drains and deletes all nodes of a group at once!

Since we use NLB with IP target in front of our NGINX ingress controller Pods (the entry points to our services), this behavior has a great impact on the whole cluster! Even if we configure a loose PDB with maxUnvailable 25%, the time until node draining is aborted, is too less to enable the Pods to move to the new nodes in compliance with the required replica count: The new NGINX Pods need ~4-5 minutes to become healthy and receive traffic on the new nodes due to an NLB health check issue (see kubernetes-sigs/aws-load-balancer-controller#1834). So in worst case we only have 25% of our original replica count up and running on the new nodes before the old nodes are deleted.

Our workaround - if we cannot avoid changes that lead to node group recreations - for the moment is (our EKS is managed with Terraform):

  1. Delete affected node groups from TF state so that they are not automatically deleted when applying the change
  2. Run terraform apply so that new node groups are created
  3. Set desired size of the new node groups to size of the "old" node groups to create sufficient space for workload movement
  4. Use eksctl delete nodegroup --name <nodegroup> --cluster <cluster-name> --timeout 60m --parallel 1 to delete your old node groups gracefully! eksctl cordons all nodes at a time (scheduling disabled), but unlike the previously observed behaviour, draining can take place with only one node at a time (see --parallel parameter) and a user-defined timeout (see --timeout parameter)! You can also drain nodes manually via kubectl but eksctl does this work for us very well.

For the future I wish some configuration parameters for managed node group deletions, analogous to the current "Node group update configuration", where I can configure something like maxUnvailable. Furthermore I'd like to configure a custom timeout for node draining. In total I don't want to have manual work or use 3rd party tools for standard maintenance tasks like the change of the instance types!

We have the same exact issue. With AWS no matter what solution you choose, there is one thing broken :(

AnhQKatalon commented 1 year ago

We are meeting exactly the same issue as @acegrader33 and @siku4 mentioned.

Firstly, the undefined "few minutes" should be 5-7 minutes, as I have watched and reproduced many times. Unluckily, my two Istio Ingress pods need a total 10 mins (5 mins x 2) for graceful shutdown, so the 1st Pod is okay, but the 2nd Pod will be forced terminated and cause many 504 errors on our API Gateway.

Secondly, I already tested all the possible solutions that may affect the forced shutdown like Node Termination Handler, stop protection, terminate protection, turn off force upgrade, etc... But all seems no luck, the Node will be forced terminated after 5 mins.

So our workaround is also the same as @acegrader33, avoiding any upgrades that will cause the replace behavior of Terraform, like changing the instance type. If we really need to do so, we will:

  1. Terraform apply a new module definition to create a New Managed Node Group that has exactly the same configuration as the one we intend to change (Label, Taints, etc...), except the properties we need to change
  2. Manual Cordon - Drain the Old Nodes -> It will drain the two Istio pods
  3. Wait for the two Istio pods shutdown gracefully (about 10 mins)
  4. Terraform destroy the old Managed Node Group
miguelgmcs commented 11 months ago

We are in the same scenario with Strimzi managed Kafka clusters, this might leads us to offline partitions when the replica lag is big enough on the already moved brokers.

We are using the manual approach for the moment, not ideal.