Open gaelayo opened 1 week ago
Do you have PDBs/do-not-disrupt pods setup? Those would slow down the rate that Karpenter can drain your node. In v1, we also wait for the instance to be fully terminated before removing the node/nodeclaim, so you might be seeing that as well. That ensures that all applications are cleaned up before we go ahead deregister the node from the cluster.
You may be interested in setting terminationGracePeriod
on your NodePool.Spec.Template.Spec.TerminationGracePeriod
to set a timeout on how long Karpenter can be draining a node before it's forcibly cleaned up. https://karpenter.sh/docs/concepts/disruption/#terminationgraceperiod
I do not have PDBs on the 5 pods that remains on the node while it is "stuck" (aws-node
, ebs-csi-node
, kube-proxy
, monitoring-prometheus-node-exporter
, node-problem-detector
). They are all from DaemonSet
s. However, some of these pods are in the priority class system-node-critical`, if this matters.
Thank you for the link to TerminationGracePeriod
, this may be what I end up using, even if I would prefer to understand why some pods seems to be blocking draining. I am not even sure that "blocking draining" are the right words, because when I look at the logs, I see:
Warning FailedDraining 64s karpenter Failed to drain node, 9 pods are waiting to be evicted │
│ Warning InstanceTerminating 51s karpenter Instance is terminating
Which, as I understand it, means that all pods were evicted after 13s, and that the instance was terminating.
Writing this, I wonder if this is just an issue with AWS taking too long to terminate the instance ? I'll try to monitor the status of the AWS instance alongside karpenter to see if the instance if really placed in a Terminating
state on the AWS side of things,
Description
Observed Behavior: Instance takes more than 5 minutes to terminate. Not sure if this is a bug or to be expected, but this sounds quite long (especially since we run
SpotToSpotConsolidation
, which leads to a lot of volatility in our pods.Expected Behavior: The node should be deleted quickly.
Reproduction Steps (Please include YAML): Karpenter deployment yaml:
When deleting a node, it quickly show
Instance is terminating
. But then takes more than 5 minutes to actually delete the pod.AFAIK the 9 pods that are waiting to be evicted are daemonsets (such as prometheus, nvidia gpu plugin, nvidia NFD, GPF, and also aws-node, ebs-csi, ...)
I see that quite rapidly, the only 5 pods remaining on the node are
aws-node
,ebs-csi-node
,kube-proxy
,monitoring-prometheus-node-exporter
,node-problem-detector
.I enabled
debug
logging on karpenter, but I cannot see anything related to the node except the following lines:Versions:
Chart Version:
1.0.1
Kubernetes Version (
kubectl version
):v1.29.7-eks-2f46c53
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment