Closed strugdt closed 2 months ago
Is it waiting for the computer or environment to be created? Could you confirm that those have happened. The pipeline doesn't run for any longer than 15 mins.
It fails while it is still training the model, and I noticed that pipeline in AML (prod_taxi_fare_run_11) continued Running despite the error in ADO and it completed about 15 minutes later to bring model (taxi-model) into Registry. Compute (cpu-cluster), Environments (taxi-train-env), have all been deployed correctly and everything else was in order as validated after infra deployment pipeline.
Please can we also consider changing the VM type for compute, and I see for a Trial/Free Subscription/Visual Studio MCT Subscriptions, in the region selected 'éastus' a quota limit of 6 vcore applies. As a result all subsequent provisioning for say compute for batch interencing will fail for all using trial subscription. Perhaps would be nice to put a note before deploying endpoints exercise, that there is need to adjust either region, or vm type to suceed.
Code: InferencingClientCreateDeploymentFailed
Message: InferencingClient HttpRequest error, error detail: {"errors":{"VmSize":["Not enough quota available for Standard_DS3_v2 in SubscriptionId fe0c79f9-9d6e-43fc-8d3f-d9e79d40ea49. Current usage/limit: 0/6. Additional needed: 8 Please see troubleshooting guide, available here: [https://aka.ms/oe-tsg#error-outofquota"]}
Thanks a lot!
Thank you for letting us know about the challenges of running this in the Trial / Free / MCT subscription. So If I understand it correctly, the expectation would be to add notes in the ADO Guide about the limitation and the potential issue with the ADO Agent. I can add the notes to ensure that the correct VM size is selected and/or ask the user to use the best available option under the subscription. Would that help? Or do you have other recommendations?
Yes, adding the notes before the step or activity would definitely help. I believe more structured way could be -
Also, another issue about pipeline agent is irrespective of subscription I believe. So, if we can just mention a note before Registering the model to ignore ADO pipeline error as long as AML pipeline is successful and the model is seen in registry, they are good to proceed. Hope this helps, thanks for prompt response.
Do you want to create PR with the updates? You are the best person to help us write it out and the best guidance that we can provide for the others who might try it out. Thanks @strugdt
That makes sense. I will do it and tag once done. @setuc Thank you.
Added a note in the documentation and merged the PR https://github.com/Azure/mlops-v2/pull/125
Describe the bug or the issue that you are facing
Hi,
Model Training pipeline "deploy-model-training-pipeline.yml" errors out after close to 75 minutes. I've though verified that all necessary artifacts are created without any issues including the model registration into registry. Problem appears to be that ADO pipeline listener agent can't connect to underlying compute for a prolonged time although I have verified compute was up and running with no issues. Due to the error log collation from AML to ADO is also obstructed.
I have re-produced this 2 times using 2 different accounts in ADO/Azure, and using 2 different project and subscriptions as well. Subscription used : Free Trial and Visual Studio for MCTs
Error Message :![image](https://github.com/Azure/mlops-v2/assets/16652407/5a5771a0-7432-4e6f-9a7c-b8550a2724a6)
More details : On both occasions it fails after about 70 minutes.![image](https://github.com/Azure/mlops-v2/assets/16652407/71fa6d6b-bb2b-4d9b-b72a-d1d59c106f21)
Total Training time in AML :![image](https://github.com/Azure/mlops-v2/assets/16652407/541146fc-1c86-498d-8116-c7cac96aa702)
Based on my analysis, it is a known issues in ADO but there isn't sufficient RCA available. In which case, adding a note into the deployment guide letting learners know, and still encourage them to validate AML artifacts and continue would help.
Steps/Code to Reproduce
I have re-produced this 2 times using 2 different accounts in ADO/Azure, and using 2 different project and subscriptions as well. Subscription used : Free Trial and Visual Studio for MCTs
Expected Output
Pipeline status to be reported successful.
Versions
ADO Azure ML CLI Classic ML
Which platform are you using for deploying your infrastrucutre?
Azure DevOps (ADO)
If you mentioned Others, please mention which platformm are you using?
No response
What are you using for deploying your infrastrucutre?
Bicep
Are you using Azure ML CLI v2 or Azure ML Python SDK v2
Azure ML CLI v2
Describe the example that you are trying to run?
Tabular.