Open stevehipwell opened 2 years ago
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.
@bwagner5 this is probably the biggest pain point with v1 and unmanaged node groups.
Has this functionality been considered for v2.0.0 or v2.x? It's still the biggest pain point with EKS that when we update our nodes the cluster often breaks.
@cjerad I've had a quick look at the v2 code (see https://github.com/aws/aws-node-termination-handler/pull/707#issuecomment-1283704197) and it looks to me like the only thing missing from the new terminator concept to solve this issue is the ability to specify a concurrency value per terminator. I'd be happy to offer this functionality as a PR or as a more details spec to be implemented?
CC @bwagner5
I'd like NTH to be able to group nodes (similar to the CA --balance-similar-node-groups) and support processing n nodes per group (this can still be constrained by the workers configuration).
When using NTH to manage ASG instance refresh events it is very easy to get a cluster into a blocking race condition due to pods being terminated off different nodes causing no nodes to be able to fully shut down due to PDBs.
So pods are not terminated because that would violate the PDB, but there are no available nodes on which to launch new pods. Do I have that right? If so, how would the ability to "group nodes" and then "processing n nodes per group" address the situation? NTH cannot prevent an EC2 instance from being terminated and it cannot launch new EC2 instances. But maybe I have misunderstood something in your proposal.
@cjerad my specific scenario is we update the launch templates and the ASGs instance refresh triggers to replace the outdated nodes. This fires an event for each node being replaced with a lifecycle hook and creates new instances. NTH gets the events from all nodes and can deal with the nodes before allowing the lifecycle to finish. The problem is that when using ASG for stateful workloads you have a separate ASG in each AZ, in the above scenario that means at least 3 nodes have been sent to NTH. If you have workloads spread across these 3 nodes with PDBs requiring at least 1 running pod and all 3 nodes get actioned by NTH at the same time you end up with a race condition where all nodes are cordoned and pods have been removed from all nodes but have nowhere to go which results in all nodes stuck terminating till the new compute is available (in v1 where spot needs handling too this results in a force termination).
The simple fix is to terminate the nodes with a concurrency of 1 while sending a heartbeat back for the lifecycle hook (or having a long enough timeout), this would allow a node to be cleanly drained and terminated before moving on to the next one. The simplest implementation of this would be to just add a concurrency
field to the terminator and use it for all termination types configured; this would be a great MVP solution. An improvement on top of the MVP solution would be to add a concurrencyLabelSelector
field which if provided groups the nodes into buckets before applying the concurrency rules; this would allow the use of a single terminator for instance refresh across all ASGs.
The launch template update should trigger an instance refresh event. Would the ASG's minimum healthy percentage [1] setting be helpful here? If you've already tried that but still saw issues, what behavior did you see?
@cjerad there are multiple ASGs per logical "node group", which is best practice (and the only functional solution) for stateful workloads, so something in the architecture needs to group them as one when updating to not trigger the race condition.
@stevehipwell Tell me if I am understanding your situation correctly:
Are there any inaccuracies or gaps in my description? If so, will you correct them?
How many nodes are in each of these ASGs? If each ASG only has one node, then this availability concern is to be expected, per the ASG documentation:
Instances terminated before launch: When there is only one instance in the Auto Scaling group, starting an instance refresh can result in an outage. This is because Amazon EC2 Auto Scaling terminates an instance and then launches a new instance.
If that's the case, then I could understand the need to address this availability problem at a higher level than the ASG, as you are proposing.
However, if each ASG has more than one node, you may be able to set a higher minimum healthy percentage (perhaps even as high as 100%) to cause the instance refresh to proceed at a slower pace within each ASG. If so, would that alleviate the capacity constraints that are causing this issue?
Setting the minimum healthy percentage to 100 percent limits the rate of replacement to one instance at a time.
This seems like a problem better solved closer to the infrastructure level if possible, e.g., in the individual ASGs, so we are trying to understand what your situation is and what you want to accomplish. NTH intentionally does not know or care about ASGs or other conceptual node groupings at this time, and adding such awareness would be a large change that we'd need to consider carefully.
@snay2 I think you've understood the problem pretty well.
I think there are two significant parts of our architecture which makes this issue show up more clearly than for in some other use cases. Firstly our architecture is built around the principal that clusters will have nodes well balanced across AZs, via an ASG per AZ per logical group, and pods are expected to be multi replica distributed across AZs (bring on EKS v1.24 so we get topology aware hints). Secondly our clusters all run a set of "system" nodes which is often a single node per AZ (same ASG per AZ architecture) and runs stateful workloads.
I agree that an ASG native solution would be best, but scale to 0 for the ASGs behind MNGs hasn't been completed years after it was first requested so I don't think this would be realistic (see https://github.com/aws/containers-roadmap/issues/1865 & https://github.com/aws/containers-roadmap/issues/1866 for related requests for MNGs). I think this is very much a NTH problem, especially in it's v2 spec, as the solution wouldn't need to be ASG based as it could be based on labels only (which in our case would be seeded by the ASGs).
This is a major issue for us at the moment and there isn't a solution, so we need a fix and I think of the existing components we run on our EKS clusters NTH is the best fit. Alternatively we could write our own controller but that seems like quite a lot of duplication.
Describe the feature I'd like NTH to be able to group nodes (similar to the CA
--balance-similar-node-groups
) and support processingn
nodes per group (this can still be constrained by the workers configuration).I assume that v2 would be designed around this kind of concept, but I think it'd be worth doing in v1 assuming it wouldn't take too much effort.
Is the feature request related to a problem? When using NTH to manage ASG instance refresh events it is very easy to get a cluster into a blocking race condition due to pods being terminated off different nodes causing no nodes to be able to fully shut down due to PDBs. This results in hard terminations and general cluster instability.
Describe alternatives you've considered Using a single worker would work but it would be to slow to respond to time critical events and even for instance refresh it could be too slow for good usability.