Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.56k stars 2.78k forks source link

Machine Learning Model Deployment to AKS Fails when using Cluster Autoscaler #29902

Open phehr2 opened 1 year ago

phehr2 commented 1 year ago

Describe the bug Unable to deploy models to AKS via the Python SDK azureml.core.Model.deploy when the AKS cluster Autoscaler is enabled. The deployment times out after 5 minutes before the autoscaler has a chance to scale out to support this new workload.

To Reproduce Steps to reproduce the behavior:

  1. Stand up an AKS cluster with autoscaler enabled. Set the minimum nodes to 1 and the max is at least 2
  2. After the cluster starts, ensure that the number of running nodes is less than the max. If they are equal, update the autoscaler rules by incrementing the max nodes so that it's at least 1 higher than the current number of running nodes
  3. Attach the new cluster as Kubernetes Compute in Azure ML
  4. Create an Azure ML model
  5. Deploy the new Azure ML model to the new cluster using azureml.core.Model.deploy. Ensure that the AksWebservice.deploy_configuration has cpu_cores and/or memory_gb values set high enough such that AKS would not be able to schedule the model onto the existing single node (due to the system resources running there already), but low enough that they can fit on whatever SKU size you selected in step 1. If staged properly, it should trigger the cluster autoscaler to begin adding a new node
  6. Watch as the AKS autoscaler kicks in and begins adding another node
  7. Watch as the Azure ML deployment fails at the 5 minute mark, with a message: Couldn't Schedule because the kubernetes cluster didn't have available resources after trying for 00:05:00
  8. Watch as the AKS autoscaler finishes adding the new node, just a little too late

Expected behavior The deployment should wait for longer than 5 minutes before timing out, if the autoscaler is actively adding an additional node to support the requested workload.

Screenshots N/A

Additional context This doc tells me it's due to resource constraints, which is true. But the real problem is that deployment isn't waiting long enough for the AKS node pool to scale out; which (for us) takes just a bit longer than 5 minutes to scale out.

This could be worked around by making the timeout an optional parameter to the deploy function via the deployment_config parameter, with a default value of 5 minutes.

github-actions[bot] commented 1 year ago

Thank you for your feedback. This has been routed to the support team for assistance.

github-actions[bot] commented 1 year ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github @Azure/azure-ml-sdk.