Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 308 forks source link

[Feature] smarter scaling question/suggestion? #3619

Closed lovettchris closed 6 days ago

lovettchris commented 1 year ago

Is your feature request related to a problem? Please describe.

I have a job that slams the pod SKU I have chosen, so I know up front that I never want more than one of my pods per node when auto-scaling my cluster. Is there a way to configure the HorizontalPodAutoscaler to do this? The targetCPUUtilizationPercentage with a low percentage like 20% doesn't seem to work, it still tries to put 2 pods on that node which is too much and one of those pods will crash with out of memory and this takes a lot of valuable processing time away from the first pod.

Describe the solution you'd like

I want to tell AKS to never put more than one of my pods per node. How can I do that? Is ther another type of auto-scaler I should be using, can I do this with a custom metric? Any samples available?

Describe alternatives you've considered

I've considered manually running az aks scale --resource-group myResourceGroup --name myAKSCluster --node-count 20 to force the creation of 20 nodes, but how can i be sure AKS will utilize all 20 nodes before doubling up pods on a single node? Plus I'd prefer to have auto-scaling.

Additional context

The job is a data driven neural network quantization which is a heavy duty process that takes about 10 minutes to complete per model, I want my cluster to scale horizontally so I can process about 20 models in parallel using 20 nodes, so I can do all of them in 10 minutes, but the system today takes much longer because of all this thrashing.

circy9 commented 1 year ago

Solution 1: Pod anti-affinity

"In the following example Deployment for the Redis cache, the replicas get the label app=store. The podAntiAffinity rule tells the scheduler to avoid placing multiple replicas with the app=store label on a single node. This creates each cache in a separate node."

https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#more-practical-use-cases

Solution 2: Topology spread constraints

https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#example-multiple-topologyspreadconstraints

lovettchris commented 1 year ago

Cool thanks, I will give these a try and see what happens.

lovettchris commented 1 year ago

Pod anti-affinity is working perfectly, this will save me a lot of compute time thrashing machines, thanks so much!

microsoft-github-policy-service[bot] commented 9 months ago

Action required from @Azure/aks-pm

microsoft-github-policy-service[bot] commented 8 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 8 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 7 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 7 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 6 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 6 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 5 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 5 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 4 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 4 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 3 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 3 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 2 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 2 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 1 month ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 1 month ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 3 weeks ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 1 week ago

Issue needing attention of @Azure/aks-leads