aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.85k stars 965 forks source link

EKS jobs not causing Karpenter to scale nodes #7355

Open oweng opened 1 week ago

oweng commented 1 week ago

Description

I've been looking through the docs, and maybe I am missing something, but we currently have all our node pools being scaled via Karpenter with no issues at all for deployments.
Recently we have started some Dagster deployments and when the data runs start up, they start 25 Batch Jobs. When this happens, they are all pinned to the single node in the node pool, and we don't see Karpenter scaling the nodes. Pod-wise, they all startup and immediately enter a running state it seems, and more or less the instance becomes unresponsive until they eventually finish their work. Just wondering if there is something I am missing?

YuriFrayman commented 1 week ago

Take a look at alternative solutions cast.ai where you gained significant stability coupled with significant savings

gladiatr72 commented 1 week ago

Sounds like you haven't defined pod.spec.containers.resources.requests.cpu, or, if you have, you've seriously low-balled it. Set ...requests.cpu to 1 and see if that doesn't sort it out. I'm not familiar w/ Dagster but I'd also check the docs to determine how it configures its concurrency without explicit instructions. If it has such a knob, set it to a single worker (or set it how you like but also use that value for ...requests.cpu)

jmdeal commented 1 week ago

That definitely seems realistic, Karpenter is not responsible for scheduling nodes, kube-scheduler is. So if the pods successfully scheduled, that means Karpenter fulfilled it's purpose in ensuring enough capacity was available on the cluster to fulfill the pods' requests.