Open oweng opened 1 week ago
Take a look at alternative solutions cast.ai where you gained significant stability coupled with significant savings
Sounds like you haven't defined pod.spec.containers.resources.requests.cpu
, or, if you have, you've seriously low-balled it. Set ...requests.cpu
to 1
and see if that doesn't sort it out. I'm not familiar w/ Dagster but I'd also check the docs to determine how it configures its concurrency without explicit instructions. If it has such a knob, set it to a single worker (or set it how you like but also use that value for ...requests.cpu
)
That definitely seems realistic, Karpenter is not responsible for scheduling nodes, kube-scheduler is. So if the pods successfully scheduled, that means Karpenter fulfilled it's purpose in ensuring enough capacity was available on the cluster to fulfill the pods' requests.
Description
I've been looking through the docs, and maybe I am missing something, but we currently have all our node pools being scaled via Karpenter with no issues at all for deployments.
Recently we have started some Dagster deployments and when the data runs start up, they start 25 Batch Jobs. When this happens, they are all pinned to the single node in the node pool, and we don't see Karpenter scaling the nodes. Pod-wise, they all startup and immediately enter a running state it seems, and more or less the instance becomes unresponsive until they eventually finish their work. Just wondering if there is something I am missing?