Open Nuru opened 5 months ago
@engedaam wrote:
Karpenter considers a node empty when only daemonSet and static pods remain on a node. What kind of pods are landing on the nodes you expect to be disrupted? (are these short lived pods) Also, why does
WhenUnderutilized
not work for your use case?
WhenUnderutilized
does not work because Karpenter does not support consolidateAfter
when the consolidationPolicy
is WhenUnderutilized
. Our use cases are similar to the ones listed in that issue.
All kinds of pods end up on nodes I would otherwise like to be disrupted, but the problematic ones are the long-lived Deployment Pods that get redistributed from the existing Nodes when we start a job run and dramatically scale up, because they prevent us from scaling back down to the previous, compact Node Pool. If we could annotate the pods with "OK to Evict" that would lessen the problem. Note that the Kubernetes Cluster Autoscaler has such an annotation: "cluster-autoscaler.kubernetes.io/safe-to-evict": "true"
.
We have a limit on our node pool size and a queue of backlogged jobs, so our typical experience is:
Without consolidateAfter
, the node hosting the first pod gets marked for deletion before the next pod is deployed, causing excessive churn and slowing everything down by 90s per job. (Yes, this is an exaggeration, because we can and do batch jobs, but we see it happen IRL even with multiple staggered pods on a node, even though they are all annotated against disruption.)
@Nuru It would seem to me from your description that your jobs can get disrupted, that's why I'm trying to understand why you need consolidateAfter
for when consolidationPolicy
is WhenUnderutilized
. Can you give a little more detail as to why you need to set consolidateAfter
for when consolidationPolicy
is WhenUnderutilized
? You should be able to achieve your intended goal, by using WhenUnderutilized
and then by setting karpenter.sh/do-not-disrupt: "true"
on nodes you don't want to be disrupted.
@engedaam I'm confused by your confusion. :-) Maybe I was not clear that when I say "job" I mean a task that runs for a short time, unlike a service that runs "forever".
No, jobs cannot be disrupted, they need to run from start to finish in one Pod on one Node. Once the job is finished, the Pod is destroyed and the next job can run in a new Pod on a new Node. Other, long-lived, Pods in the cluster can be disrupted, which is how they end up on the huge Nodes spun up to handle a wave of jobs.
I do not have and do not want to be forced to write a controller that monitors where jobs are running and manages "do-not-disrupt" annotations on relevant Nodes.
Did you review the use cases at https://github.com/kubernetes-sigs/karpenter/issues/735 ?
Would termination grace period work for your use case? https://github.com/kubernetes-sigs/karpenter/pull/916
@engedaam No, termination grace period would not help me.
consolidateAfter
when consolidationPolicy
is WhenUnderutilized
. Of course, this is premised on WhenUnderutilized
operating properly, meaning it handles cases such as https://github.com/kubernetes-sigs/karpenter/issues/651#issuecomment-2197750561 and https://github.com/kubernetes-sigs/karpenter/issues/1167WhenEmpty
work a bit more like WhenUnderutilized
by being able to mark a Pod as "not present for the purposes of considering the node Empty" the way DaemonSet Pods are ignored.
Description
The immediate ask is to document what makes a Node considered "empty" for the purposes of
spec.disruption.consolidateAfter
.Furthermore, is there a way to annotate a pod such that its presence on a Node should not stop Karpenter from considering the node "empty" for the purposes of consolidation? Maybe
karpenter.sh/do-not-disrupt: "false"
? If there is a way, it should be documented. If there is not a way, then can this be considered a feature request, or should I open that separately?The bigger ask is to fully document the NodePool
spec.disruption
configuration. That seems to have slipped through the cracks. The only thorough documentation for it is in the CRD itself.