[Feature] smarter scaling question/suggestion?

lovettchris commented 1 year ago

Is your feature request related to a problem? Please describe.

I have a job that slams the pod SKU I have chosen, so I know up front that I never want more than one of my pods per node when auto-scaling my cluster. Is there a way to configure the HorizontalPodAutoscaler to do this? The targetCPUUtilizationPercentage with a low percentage like 20% doesn't seem to work, it still tries to put 2 pods on that node which is too much and one of those pods will crash with out of memory and this takes a lot of valuable processing time away from the first pod.

Describe the solution you'd like

I want to tell AKS to never put more than one of my pods per node. How can I do that? Is ther another type of auto-scaler I should be using, can I do this with a custom metric? Any samples available?

Describe alternatives you've considered

I've considered manually running az aks scale --resource-group myResourceGroup --name myAKSCluster --node-count 20 to force the creation of 20 nodes, but how can i be sure AKS will utilize all 20 nodes before doubling up pods on a single node? Plus I'd prefer to have auto-scaling.

Additional context

The job is a data driven neural network quantization which is a heavy duty process that takes about 10 minutes to complete per model, I want my cluster to scale horizontally so I can process about 20 models in parallel using 20 nodes, so I can do all of them in 10 minutes, but the system today takes much longer because of all this thrashing.

circy9 commented 1 year ago

Solution 1: Pod anti-affinity

"In the following example Deployment for the Redis cache, the replicas get the label app=store. The podAntiAffinity rule tells the scheduler to avoid placing multiple replicas with the app=store label on a single node. This creates each cache in a separate node."

https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#more-practical-use-cases

Solution 2: Topology spread constraints

https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#example-multiple-topologyspreadconstraints

lovettchris commented 1 year ago

Cool thanks, I will give these a try and see what happens.

lovettchris commented 1 year ago

Pod anti-affinity is working perfectly, this will save me a lot of compute time thrashing machines, thanks so much!