Open infa-ddeore opened 1 year ago
Which ELB are you referencing? What's the amount of time you need this delay? Karpenter currently will wait until the node is fully drained, terminate the instance, then remove the node. There's a period of time where the instance can still be shutting down when the node is deleted. Would it be sufficient to wait for the instance to be fully terminated to delete the node?
I think this should be solved on pod level using terminationGracePeriodSeconds and lifecycle prestop setting. For us it works perfectly this is some example https://github.com/arvatoaws-labs/wordpress-bedrock-helm-chart/blob/master/templates/deployment-worker.yaml
If implemented, https://github.com/kubernetes/enhancements/issues/4212 (an API for declarative node maintenance / drains) might help with this.
adding delay in pod shutdown solves problem of a node where pod is running but in a use case where externalTrafficPolicy
is cluster
then all ec2 instances from cluster are added to AWS ELB
if a node becomes empty in this case for whatever reason (eg. spark jobs are completed), that node will be deleted immediately, if that node is serving traffic (via kube-proxy) then there will be client side errors
We've been thinking about driving termination of nodes through adding a NoExecute taint on the node so that all DS pods and other non-drainable pods can clean up before the node is deleted. Would this be sufficient? Rather than add in a per-node delay, I'd rather drive this by when the pods themselves say everything has been cleaned up.
will NoExecute
drain remove kube-proxy pod daemonset as well? if yes, still this action will be disruptive if node is proxying a long running request from ELB when connection draining setting is enabled with some interval
NoExecute
will respect the pod's tolerationGracePeriodSeconds. You'd be able to use that to drive how long you tolerate scale down. I'm going to close this in favor of https://github.com/aws/karpenter-core/issues/621
@njtran I'm not sure why this issue was closed in favor of the other issue, as the other issue does not address this specific use case.
When using AWS NLB in instance mode with externalTrafficPolicy=cluster, a node will still receive traffic from the NLB during deregistration. This is normally between 180s-200s.
Since the traffic policy is cluster, the node doesn't need to host the target pod as kube-proxy will side route traffic to another node where the pod exists.
The issue here is when the node reaches a state at which karpenter determines it can destroy it the deregistration delay hasn't been reached on the NLB and the node is still likely handling connections and sending them to other nodes. Karpenter terminating the node breaks this and results in scores of 5xx errors.
This is a well known issue and was resolved with cluster-autoscaler by implementing a termination wait lifecycle policy to the ASG so that a termination signal to the ASG started a timer and then destroyed the node.
Being able to tell karpenter to wait 'n' number of minutes after it knows it can delete a node before actually doing so would resolve this issue.
This is preferable to handling this with terminationGracePeriodSeconds and preStop, as that significantly impacts rollout timing on large deployments. Telling pods to wait 200s makes in cluster rolling restarts terrible. With the karpenter approach your deployment can roll fairly quickly, maintaining all connections, and is just a empty node waiting to be removed.
Can we consider reopening this?
EDIT: This appears to have existed in versions prior to 0.32.x: https://github.com/aws/karpenter-provider-aws/blob/release-v0.31.x/pkg/apis/crds/karpenter.sh_provisioners.yaml#L294-L300
This ttlSecondsAfterEmpty
appears, at first glance, to be the behavior being requested. If I'm reading that right, Karpenter would drain the node to a point where only daemonsets would be present, and then the ttlSecondsAfterEmpty
counter would start and once it completed, then Karpenter would send the final termination single to AWS to destroy the node.
I'm not sure why it was removed, but if my assumption about it's behavior is correct, that seems to be exactly what would resolve this issue.
@njtran I'm not sure why this issue was closed in favor of the other issue, as the other issue does not address this specific use case.
When using AWS NLB in instance mode with externalTrafficPolicy=cluster, a node will still receive traffic from the NLB during deregistration. This is normally between 180s-200s.
Since the traffic policy is cluster, the node doesn't need to host the target pod as kube-proxy will side route traffic to another node where the pod exists.
The issue here is when the node reaches a state at which karpenter determines it can destroy it the deregistration delay hasn't been reached on the NLB and the node is still likely handling connections and sending them to other nodes. Karpenter terminating the node breaks this and results in scores of 5xx errors.
This is a well known issue and was resolved with cluster-autoscaler by implementing a termination wait lifecycle policy to the ASG so that a termination signal to the ASG started a timer and then destroyed the node.
Being able to tell karpenter to wait 'n' number of minutes after it knows it can delete a node before actually doing so would resolve this issue.
This is preferable to handling this with terminationGracePeriodSeconds and preStop, as that significantly impacts rollout timing on large deployments. Telling pods to wait 200s makes in cluster rolling restarts terrible. With the karpenter approach your deployment can roll fairly quickly, maintaining all connections, and is just a empty node waiting to be removed.
Can we consider reopening this?
EDIT: This appears to have existed in versions prior to 0.32.x: https://github.com/aws/karpenter-provider-aws/blob/release-v0.31.x/pkg/apis/crds/karpenter.sh_provisioners.yaml#L294-L300
This
ttlSecondsAfterEmpty
appears, at first glance, to be the behavior being requested. If I'm reading that right, Karpenter would drain the node to a point where only daemonsets would be present, and then thettlSecondsAfterEmpty
counter would start and once it completed, then Karpenter would send the final termination single to AWS to destroy the node.I'm not sure why it was removed, but if my assumption about it's behavior is correct, that seems to be exactly what would resolve this issue.
well explained with cluster-autoscaler we added customization in cluster autoscaler that removes the node from LB, ASG has lifecycle hook that waits for some time after cluster autoscaler deletes the node, this is to make the node respond on in-flight/ long running connections
@awiesner4 as per our testing ttlSecondsAfterEmpty
isnt what we are looking for, this setting actually waits for the configured time if a node has become empty outside of karpenter, eg. kubectl drain
, it doesn't add delay in the node deletion process triggered by karpeter, so there needs to be a new feature to support adding delay between node cordon and deletion time
Thank you, @infa-ddeore.
That actually makes sense based on the replacement functionality in NodePools of disruption.consolidationPolicy
and disruption.consolidateAfter
.
So the basic problem here is (and I still need to test this to prove it out) we can theoretically set disruption.consolidationPolicy
to WhenEmpty
and then disruption.consolidateAfter
to something like 300s
and it should handle most use cases as, from what I can tell, the following is true:
disruption.consolidationPolicy
is WhenEmpty
.EDIT: The main problem I see here is a bad assumption of drained pod connections being equal to node connections from NLB being drained. That's just not true in this use case.
Hey @awiesner4, sure I can re-open.
This was closed in favor of https://github.com/kubernetes-sigs/karpenter/issues/621 as this controls and ensures that all pods on a node are cleaned up prior to executing node termination. The disruption settings you're seeing happen way before this process. I'd recommend reading more about this in the docs, as emptiness that you're referring to in v0.31.1 is not the same here.
Is there a signal that Karpenter can consume to know when it's waited long enough? Does seeing 5xx errors impact performance on your other NLBs?
This was closed in favor of https://github.com/kubernetes-sigs/karpenter/issues/621 as this controls and ensures that all pods on a node are cleaned up prior to executing node termination.
Yeah, I think that's where the disconnect is as it relates to this issue.
Pod cleanup (i.e., connection draining) from a node is not equal to connection cleanup for that node when using AWS NLB in instance mode with externalTrafficPolicy: Cluster
on your K8s Service.
To illustrate this, let's say you have a 2 node cluster with NodeA and NodeB. Both NodeA and NodeB have pods running on them that the AWS NLB is sending connections to.
A connection from an NLB in this setup can be:
In this scenario, Karpenter decides the NodeA EC2 is less cost effective than another option, so it decides to consolidate and starts draining the pod on NodeA (and taint the node appropriately). When it does this, the aws-load-balancer-controller
also picks up the new taints and starts the deregistration process with the NLB.
When the pod on NodeA finishes draining, that does not mean that all connections to that Node have completed and been drained, as the NLB is on a separate loop, since it can use any node in the node group to send to any pod. Therefore, NodeA still likely has connections open from the NLB that are going through NodeA to NodeB to another pod that is not draining.
The problem here is Karpenter assumes that the Pod removal on NodeA means it's good to terminate NodeA. Except that's not true because the NLB hasn't finished deregistering NodeA yet and still has open connections that go through NodeA to the pod(s) on NodeB.
So when Karpenter kills NodeA before the NLB deregistration completes, this results in client facing 5xx errors.
Is there a signal that Karpenter can consume to know when it's waited long enough?
A perfect solution would be for Karpenter to query the NLB to determine that the node is no longer registered and then, once the registration is complete, terminate the node.
However, a perfectly acceptable solution -- and comparable to current solution with cluster-autoscaler -- is simply to have a timer for Karpenter that can be set to any value that simply says "wait 'x' amount of seconds after node meets termination conditions to actually send termination signal to AWS." Essentially just a dumb sleep.
This way, we could set that equal to the deregistration delay value of the NLB, and the behavior would be something like:
As a side question, @njtran ... one way I'm looking to hack around this limitation right now is as follows. Would be interested to know if this would work based on your understanding:
terminationGracePeriodSeconds
that is 180s-200s (i.e., the deregistration window for the NLB).This would allow our actual Deployments to restart in a normal fashion and move around quickly, but the extra Daemonset wouldn't terminate for a pre-determined time that should prolong the Karpenter termination decision until after the NLB deregistration completes.
I figure this should work as I believe Karpenter does respect DaemonSets that were running, but ignores them if they restart. So, a prolonged DaemonSet could help me slow Karpenter down.
EDIT: I think this answers that question: https://github.com/kubernetes-sigs/karpenter/issues/621#issuecomment-1773345064
i.e., as long as I don't tolerate those taints on my DS, that DS will be respected like any Deployment, and should slow Karpenter down while not impacting my rollout restarts on the Deployments I actually care about.
EDIT 2: I actually think the above link is wrong, since the DaemonSet controller in K8s automatically adds those Tolerations to all DaemonSets: https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#taints-and-tolerations ... which means I think we're back to no real acceptable workaround on this.
The comment you've referred to is still correct, since Karpenter will add its own taint, not the unschedulable taint to nodes that are being deleted. I think that daemonset workaround might work. You'd need to model the terminationGracePeriodSeconds as the sum of the pod's terminationGracePeriodSeconds + deregistration delay.
However, a perfectly acceptable solution -- and comparable to current solution with cluster-autoscaler -- is simply to have a timer for Karpenter that can be set to any value that simply says "wait 'x' amount of seconds after node meets termination conditions to actually send termination signal to AWS." Essentially just a dumb sleep.
I don't think a sleep is the right way to solve this. If Karpenter picks a value, that means that much longer before instances are terminated. There is rarely a valid use-case for including a hacky hard-coded sleep, as this is essentially tech debt that we'd need to go back to evaluate any time Karpenter/EC2 has some performance improvements that make the sleep unnecessary. Although, there is a point, that this might be better encapsulated within https://github.com/kubernetes-sigs/karpenter/issues/740.
I don't think a sleep is the right way to solve this. If Karpenter picks a value, that means that much longer before instances are terminated. There is rarely a valid use-case for including a hacky hard-coded sleep, as this is essentially tech debt that we'd need to go back to evaluate any time Karpenter/EC2 has some performance improvements that make the sleep unnecessary. Although, there is a point, that this might be better encapsulated within https://github.com/kubernetes-sigs/karpenter/issues/740.
I understand and don't necessarily disagree completely on this. Just a point of clarification, though. I'm not asking Karpenter to have some hardcoded value. I'm asking that a parameter be exposed, with a default value of 0
(i.e., don't wait), that can be set by the end-user if they have a need to sleep those termination events. So, if the end-user doesn't set it, Karpenter behaves normally. If the end-user sets it, then Karpenter sleeps the user defined timeframe before sending the termination signal.
The only other way I really see to solve this is to have Karpenter inspect NLB membership information for nodes that are targeted for termination, and watch the NLB for when the node no longer exists as registered/draining (i.e., the node is not found in the target group)... and then send the termination on that.
Unfortunately, the DaemonSet solution didn't work, as the DaemonSet controller automatically adds the tolerations to applied DaemonSets, so there's no real workaround to this.
@awiesner4 is it correct to say that the target group for the LB waits the deregistration delay
from when the node is deregistered from the target group, which I think should happen when the node.kubernetes.io/exclude-from-external-load-balancers: true
label is applied to the node? i.e. we want to wait the 300s (or whatever the deregistration delay is set to) from when the node is cordoned, rather than from when the last pod exits?
I'm trying to understand how much of an issue this will be for us, we are just about to roll out karpenter. I think that often we will be waiting for at least the default deregistration delay during pod draining anyway.
@martinsmatthews That is correct. It's always exactly difficult to get the appropriate timing, as with cluster-autoscaler
, there are lots of things at play (Autoscaling Group, aws-load-balancer-controller, NLB).
In general, I believe your understanding is correct and that the deregistration clock basically starts when the aws-load-balancer-controller removes the node from the TargetGroupBinding, which then sends the deregistration signal to the ELB... not when the last pod is removed.
we want to wait the 300s (or whatever the deregistration delay is set to) from when the node is cordoned, rather than from when the last pod exits?
This is interesting. We're basically talking about something very similar to what we're discussing here: https://github.com/aws/karpenter-provider-aws/issues/2917#issuecomment-1930832965. Basically, we need a minNodeDrainTime to make this work out well. For spot interruption, this has utility because we want to make sure that the node stays alive as long as it can before draining so that we make sure that we maximize the pod lifetime and reduce application downtime. I think for spot interruption we would want to set a combination of minNodeDrainTime and maxNodeDrainTime but here it really seems like we need minNodeDrainTime and that's it
Unfortunately, the DaemonSet solution didn't work, as the DaemonSet controller automatically adds the tolerations to applied DaemonSets, so there's no real workaround to this
To be clear, this solution should still work for the reasons that @njtran called out. Karpenter uses its own taint (karpenter.sh/disruption=disrupting:NoSchedule
) to determine if a pod is tolerating the Karpenter cordoning logic and whether Karpenter should respect its terminationGracePeriod. Therefore, default DaemonSets (as long as you ensure that you aren't tolerating all taints) should work here since Karpenter's taint isn't included in those defaults.
More info here: https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#taints-and-tolerations
One other thought that I had (this is probably not the best option, but you could do it): You could create another controller that adds a finalizer to the node. It ensures that it doesn't remove the finalizer on the node until it is 300s past its deletionTimestamp. This should get the behavior you want until/if this feature is directly built-in to Karpenter.
@jonathan-innis agreed, minNodeDrainTime
should address this if set to the deregistration delay
someone mentioned workaround at https://github.com/aws/aws-node-termination-handler/issues/316#issuecomment-1970009212
Thanks, @jonathan-innis . I hadn't come back to this in a while. I went back and retested and, you are correct, it does look as though creating a simple DaemonSet does work. I was able to coordinate the timing and, while I haven't been able to run our k6s tests through it to see if we drop any connections, the timing I'm seeing between node removal by Karpenter and it's deregistration in NLB suggests that the DaemonSet approach should work in the interim until this functionality is added into Karpenter.
Is there an update on this? We recently migrated to Karpenter and after quite some debugging we noticed the NLB issues are arising because Karpenter terminates nodes before they can be de-registered in the NLB.
It's usually not an issue when draining a node takes a bit of time, as the AWS load-balancer-controller will pick-up the taint of the nodes being drained and will start the de-registering process in the NLB.
For nodes that are drained quickly however, the AWS load-balancer-controller will not recognize this in time and the node will be terminated by karpenter before it can be de-registered from the NLB.
Adding a simple optional grace period before terminating nodes seems reasonable, as this is causing issues for everyone using Karpenter with NLBs in instance mode.
Modifying the Userdata and deploying a Daemonset as mentioned in the workaround seems very hacky.
Is there an update on this, we are also stuck with issues of node termination before draining from load balancer
Separately, as I've been looking more into this: Does IP mode completely solve this problem? Since the connections are routed directly to the pods, doesn't that completely remove other nodes from the picture of connections? Assuming you have no Instance mode-based connections that are being made on your cluster, draining a node that doesn't contain a pod that is handling a connection shouldn't matter -- you should just be able to handle your connection draining at the pod-level at that point. cc: @awiesner4 @infa-ddeore
We've been thinking a little bit more about how to handle this. I'm not in love with the time based solution since I'd rather just get a signal from ALB/NLB that it's safe to drain the node. For instance, this has become relevant with PVC volume attachments (see https://github.com/aws/karpenter-provider-aws/pull/6336) where we are proposing waiting for VolumeAttachment objects to be completely removed from the node before we consider the node fully drained and ready to be removed. Ideally, we could have some Kubernetes API-based representation of this draining completion from the ALB/NLB that we could trust. One option here is a finalizer -- though, I'm not quite sure how this would fair in instance mode when basically every node would have to receive this finalizer.
If we were going to be consider a drain delay, I'd think that there'd have to be more than a single use-case to create a generalized solution for this problem -- else, we'd probably be better off just building something more tailored to this specific problem.
Hey @jonathan-innis
Does IP mode completely solve this problem? Since the connections are routed directly to the pods, doesn't that completely remove other nodes from the picture of connections?
Yes and no. IP mode does resolve this specific issue, but the problem with IP mode is that NLB registration takes anywhere from 3-5 minutes and is expected behavior. This means that, with PodReadinessGates configured (so that pods don't go ready until they're registered), we end up with extremely slow rollouts for our ingress deployments that have 100's of pods, as each pod takes 3-5 minutes to become ready as it's purely based on when the NLB completes registration.
With instance mode
, new nodes do take 3-5 minutes to come online, but the pods on them become ready immediately, and the nodes themselves are able to receive traffic from other nodes in the cluster to send to those pods (just not directly from the NLB until the registration is complete).
I generally agree with your mental model of how to approach this. The robust solution is to have Karpenter determine the NLB that owns the node, and confirm when deregistration completes. Once it does, then remove the node. This would scale better if times fluctuated, etc..
My only mention of the time based solution is two-fold:
terminating:Wait
for 5 minutes to allow the NLB to drainJust as an update for the rest of the users on this thread, we are moving forward with the DaemonSet approach. There's no user-data changes necessary.
We're simply just creating a DaemonSet that runs on all of our ingress nodes:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: karpenter-termination-waiter
namespace: istio-system
spec:
selector:
matchLabels:
app: karpenter-termination-waiter
template:
metadata:
labels:
app: karpenter-termination-waiter
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: service-group
operator: In
values:
- istio-igw
- istio-igw-pub-adserver
containers:
- command:
- sleep
- infinity
image: alpine:latest
lifecycle:
preStop:
exec:
command:
- sleep
- '300'
name: alpine
resources:
limits:
cpu: 5m
memory: 10Mi
requests:
cpu: 2m
memory: 5Mi
priorityClassName: system-node-critical
terminationGracePeriodSeconds: 300
tolerations:
- effect: NoSchedule
key: service-group
operator: Equal
value: istio-igw
- effect: NoSchedule
key: service-group
operator: Equal
value: istio-igw-pub-adserver
- effect: NoSchedule
key: karpenter.sh/nodepool
operator: Equal
value: plat-infra-cpu-istio-igw
- effect: NoSchedule
key: karpenter.sh/nodepool
operator: Equal
value: plat-infra-cpu-istio-igw-pub-adserver
I've run this through lots of k6s tests where I'm sending lots of traffic through our ingress gateways and I am adding and removing pods/nodes and not dropping any connections. I can see the DaemonSet pod get terminated and begin it's 5m wait (and watch the NLB finish registration before the node is removed by Karpenter).
It's not the ideal solution, but it does appear to work.
@jonathan-innis just an update on the workaround suggested in this comment, the controller looked to be the cleanest/simplest option for us but it doesn't appear to work. The node remains in k8s until our controller removes its finalizer but karpenter still terminates the ec2 instance as soon as it has drained it. So we're going to pivot to the daemonset approach from above.
This is when using the controller/finalizer on the node:
{"level":"INFO","time":"2024-07-19T15:15:54.736Z","logger":"controller.node.termination","message":"tainted node","commit":"17dd42b","node":"ip-10-39-164-108.us-west-2.compute.internal"}
...
{"level":"INFO","time":"2024-07-19T15:16:46.602Z","logger":"controller.node.termination","message":"deleted node","commit":"17dd42b","node":"ip-10-39-164-108.us-west-2.compute.internal"}
{"level":"INFO","time":"2024-07-19T15:21:11.807Z","logger":"controller.nodeclaim.termination","message":"deleted nodeclaim","commit":"17dd42b","nodeclaim":"rce-compute-1-lb25r","node":"ip-10-39-164-108.us-west-2.compute.internal","provider-id":"aws:///us-west-2a/i-0b9685caaf7f5c6eb"}
With the daemonset approach:
{"level":"INFO","time":"2024-07-19T16:08:31.481Z","logger":"controller.node.termination","message":"tainted node","commit":"17dd42b","node":"ip-10-39-163-175.us-west-2.compute.internal"}
...
{"level":"INFO","time":"2024-07-19T16:13:38.606Z","logger":"controller.node.termination","message":"deleted node","commit":"17dd42b","node":"ip-10-39-163-175.us-west-2.compute.internal"}
{"level":"INFO","time":"2024-07-19T16:13:38.948Z","logger":"controller.nodeclaim.termination","message":"deleted nodeclaim","commit":"17dd42b","nodeclaim":"rce-compute-1-48dgr","node":"ip-10-39-163-175.us-west-2.compute.internal","provider-id":"aws:///us-west-2b/i-0efb4e2edfe15f912"}
And like @awiesner4 says, no userdata change required
@infa-ddeore This might not be the fastest solution, but have you considered provisioning an Ingress pod, such as nginx or Istio Ingress Gateway. The ingress pod can then be hosted on a managed node, similar to Karpenter. This will ensure that likelihood of deregistering this ingress pod will be very low. The ingress pod is connected to your ELB.
Routing rules can then be configured to forward traffic from the Ingress pod to the workloads hosted on your Karpenter nodes.
Description
Observed Behavior: instance is terminated immediately after draining
Expected Behavior: add configurable delay between drain and terminate to honor ELB's connection draining interval its a disruptive behaviour if node is deleted while serving existing traffic when ELB connection draining setting is enabled
Reproduction Steps (Please include YAML): when karpeneter deletes the node, it gets removed from ELB and then deleted immediately
Versions:
Chart Version: karpenter-v0.28.1
Kubernetes Version (
kubectl version
): 1.26Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment