Please document what makes a node considered "empty" for purposes of disruption/consolidation

Nuru commented 5 months ago

Description

The immediate ask is to document what makes a Node considered "empty" for the purposes of spec.disruption.consolidateAfter.

Furthermore, is there a way to annotate a pod such that its presence on a Node should not stop Karpenter from considering the node "empty" for the purposes of consolidation? Maybe karpenter.sh/do-not-disrupt: "false"? If there is a way, it should be documented. If there is not a way, then can this be considered a feature request, or should I open that separately?

The bigger ask is to fully document the NodePool spec.disruption configuration. That seems to have slipped through the cracks. The only thorough documentation for it is in the CRD itself.

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

engedaam commented 5 months ago

@Nuru Karpenter considers a node empty when only daemonSet and static pods remain on a node. What kind of pods are landing on the nodes you expect to be disrupted? (are these short lived pods) Also, why does WhenUnderutilized not work for your use case?

Nuru commented 5 months ago

@engedaam wrote:

Karpenter considers a node empty when only daemonSet and static pods remain on a node. What kind of pods are landing on the nodes you expect to be disrupted? (are these short lived pods) Also, why does WhenUnderutilized not work for your use case?

WhenUnderutilized does not work because Karpenter does not support consolidateAfter when the consolidationPolicy is WhenUnderutilized. Our use cases are similar to the ones listed in that issue.

All kinds of pods end up on nodes I would otherwise like to be disrupted, but the problematic ones are the long-lived Deployment Pods that get redistributed from the existing Nodes when we start a job run and dramatically scale up, because they prevent us from scaling back down to the previous, compact Node Pool. If we could annotate the pods with "OK to Evict" that would lessen the problem. Note that the Kubernetes Cluster Autoscaler has such an annotation: "cluster-autoscaler.kubernetes.io/safe-to-evict": "true".

We have a limit on our node pool size and a queue of backlogged jobs, so our typical experience is:

create a job, deploy a pod to execute it
spin up a new node for the job
job finishes, pod is deleted
process the queue
create the next job

Without consolidateAfter, the node hosting the first pod gets marked for deletion before the next pod is deployed, causing excessive churn and slowing everything down by 90s per job. (Yes, this is an exaggeration, because we can and do batch jobs, but we see it happen IRL even with multiple staggered pods on a node, even though they are all annotated against disruption.)

engedaam commented 5 months ago

@Nuru It would seem to me from your description that your jobs can get disrupted, that's why I'm trying to understand why you need consolidateAfter for when consolidationPolicy is WhenUnderutilized. Can you give a little more detail as to why you need to set consolidateAfter for when consolidationPolicy is WhenUnderutilized? You should be able to achieve your intended goal, by using WhenUnderutilized and then by setting karpenter.sh/do-not-disrupt: "true" on nodes you don't want to be disrupted.

Nuru commented 5 months ago

@engedaam I'm confused by your confusion. :-) Maybe I was not clear that when I say "job" I mean a task that runs for a short time, unlike a service that runs "forever".

No, jobs cannot be disrupted, they need to run from start to finish in one Pod on one Node. Once the job is finished, the Pod is destroyed and the next job can run in a new Pod on a new Node. Other, long-lived, Pods in the cluster can be disrupted, which is how they end up on the huge Nodes spun up to handle a wave of jobs.

I do not have and do not want to be forced to write a controller that monitors where jobs are running and manages "do-not-disrupt" annotations on relevant Nodes.

Did you review the use cases at https://github.com/kubernetes-sigs/karpenter/issues/735 ?

engedaam commented 3 months ago

Would termination grace period work for your use case? https://github.com/kubernetes-sigs/karpenter/pull/916

Nuru commented 3 months ago

@engedaam No, termination grace period would not help me.

Best option is to support consolidateAfter when consolidationPolicy is WhenUnderutilized. Of course, this is premised on WhenUnderutilized operating properly, meaning it handles cases such as https://github.com/kubernetes-sigs/karpenter/issues/651#issuecomment-2197750561 and https://github.com/kubernetes-sigs/karpenter/issues/1167
What I was going for with this request is a way to make WhenEmpty work a bit more like WhenUnderutilized by being able to mark a Pod as "not present for the purposes of considering the node Empty" the way DaemonSet Pods are ignored.

aws / karpenter-provider-aws

Please document what makes a node considered "empty" for purposes of disruption/consolidation #6253

Description