Open amitpate opened 4 months ago
Same here.
Please check if the pod is stuck in the "Terminating" status.
The pod's status stay on "Running"
You can check the karpenter log to see if it works well, or open an issue at Karpenter
I'm also seeing this behavior. I'm still new to Karpenter but with the Comfyui deployment, how does it know to scale down in the first place? Is that already configured in the disruption policy?
Could you kindly provide some Kubernetes and Karpenter logs, I'm not able to reproduce the issue.
GPU确实不会缩减,也不会自动扩容。 并发调用了500 次文生图,不会扩容新的机器
I think I misunderstood what you meant. I thought when you said there was no workload, you were referring to no pods running, not that there were no requests to ComfyUI. Karpenter is responsible for ensuring that your cluster has enough nodes to schedule your pods without wasting resources. If your pod have no node to be scheduled, Karpenter will spin up a cheapest instance (which meets your requirement in definition yaml) to schedule pod on it, and if your node has no pod running on it (no workload), the node will be terminated to save costs.
If you wanna auto scale GPU nodes, it's recommended to monitor ComfyUI pending queue with /prompt
ComfyUI API interface, and develop your own auto scaling policy (like number of pending requests exceeds the threshold), and scale in/out by just one command: kubectl scale deploy/comfyui --replicas=0/1/2/3/4/...
One thing to note, it may take some time (from a few minutes to ten minutes) from spinning up an instance to the pod get running, because we trade startup time for model loading and switching performance.
Karpenter doesn't appear to be scaling down to zero nodes when there's no workload.