This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.56k
stars
2.78k
forks
source link
Machine Learning Model Deployment to AKS Fails when using Cluster Autoscaler #29902
Describe the bug
Unable to deploy models to AKS via the Python SDK azureml.core.Model.deploy when the AKS cluster Autoscaler is enabled. The deployment times out after 5 minutes before the autoscaler has a chance to scale out to support this new workload.
To Reproduce
Steps to reproduce the behavior:
Stand up an AKS cluster with autoscaler enabled. Set the minimum nodes to 1 and the max is at least 2
After the cluster starts, ensure that the number of running nodes is less than the max. If they are equal, update the autoscaler rules by incrementing the max nodes so that it's at least 1 higher than the current number of running nodes
Attach the new cluster as Kubernetes Compute in Azure ML
Create an Azure ML model
Deploy the new Azure ML model to the new cluster using azureml.core.Model.deploy. Ensure that the AksWebservice.deploy_configuration has cpu_cores and/or memory_gb values set high enough such that AKS would not be able to schedule the model onto the existing single node (due to the system resources running there already), but low enough that they can fit on whatever SKU size you selected in step 1. If staged properly, it should trigger the cluster autoscaler to begin adding a new node
Watch as the AKS autoscaler kicks in and begins adding another node
Watch as the Azure ML deployment fails at the 5 minute mark, with a message: Couldn't Schedule because the kubernetes cluster didn't have available resources after trying for 00:05:00
Watch as the AKS autoscaler finishes adding the new node, just a little too late
Expected behavior
The deployment should wait for longer than 5 minutes before timing out, if the autoscaler is actively adding an additional node to support the requested workload.
Screenshots
N/A
Additional context
This doc tells me it's due to resource constraints, which is true. But the real problem is that deployment isn't waiting long enough for the AKS node pool to scale out; which (for us) takes just a bit longer than 5 minutes to scale out.
This could be worked around by making the timeout an optional parameter to the deploy function via the deployment_config parameter, with a default value of 5 minutes.
Describe the bug Unable to deploy models to AKS via the Python SDK
azureml.core.Model.deploy
when the AKS cluster Autoscaler is enabled. The deployment times out after 5 minutes before the autoscaler has a chance to scale out to support this new workload.To Reproduce Steps to reproduce the behavior:
azureml.core.Model.deploy
. Ensure that theAksWebservice.deploy_configuration
hascpu_cores
and/ormemory_gb
values set high enough such that AKS would not be able to schedule the model onto the existing single node (due to the system resources running there already), but low enough that they can fit on whatever SKU size you selected in step 1. If staged properly, it should trigger the cluster autoscaler to begin adding a new nodeCouldn't Schedule because the kubernetes cluster didn't have available resources after trying for 00:05:00
Expected behavior The deployment should wait for longer than 5 minutes before timing out, if the autoscaler is actively adding an additional node to support the requested workload.
Screenshots N/A
Additional context This doc tells me it's due to resource constraints, which is true. But the real problem is that deployment isn't waiting long enough for the AKS node pool to scale out; which (for us) takes just a bit longer than 5 minutes to scale out.
This could be worked around by making the timeout an optional parameter to the
deploy
function via thedeployment_config
parameter, with a default value of 5 minutes.