Model Training ADO pipeline FAILS after over an hour

strugdt commented 4 months ago

Describe the bug or the issue that you are facing

Hi,

Model Training pipeline "deploy-model-training-pipeline.yml" errors out after close to 75 minutes. I've though verified that all necessary artifacts are created without any issues including the model registration into registry. Problem appears to be that ADO pipeline listener agent can't connect to underlying compute for a prolonged time although I have verified compute was up and running with no issues. Due to the error log collation from AML to ADO is also obstructed.

I have re-produced this 2 times using 2 different accounts in ADO/Azure, and using 2 different project and subscriptions as well. Subscription used : Free Trial and Visual Studio for MCTs

Error Message :

More details : On both occasions it fails after about 70 minutes.

Total Training time in AML :

Based on my analysis, it is a known issues in ADO but there isn't sufficient RCA available. In which case, adding a note into the deployment guide letting learners know, and still encourage them to validate AML artifacts and continue would help.

Steps/Code to Reproduce

I have re-produced this 2 times using 2 different accounts in ADO/Azure, and using 2 different project and subscriptions as well. Subscription used : Free Trial and Visual Studio for MCTs

Expected Output

Pipeline status to be reported successful.

Versions

ADO Azure ML CLI Classic ML

Which platform are you using for deploying your infrastrucutre?

Azure DevOps (ADO)

If you mentioned Others, please mention which platformm are you using?

No response

What are you using for deploying your infrastrucutre?

Bicep

Are you using Azure ML CLI v2 or Azure ML Python SDK v2

Azure ML CLI v2

Describe the example that you are trying to run?

Tabular.

setuc commented 4 months ago

Is it waiting for the computer or environment to be created? Could you confirm that those have happened. The pipeline doesn't run for any longer than 15 mins.

strugdt commented 4 months ago

It fails while it is still training the model, and I noticed that pipeline in AML (prod_taxi_fare_run_11) continued Running despite the error in ADO and it completed about 15 minutes later to bring model (taxi-model) into Registry. Compute (cpu-cluster), Environments (taxi-train-env), have all been deployed correctly and everything else was in order as validated after infra deployment pipeline.

strugdt commented 4 months ago

Please can we also consider changing the VM type for compute, and I see for a Trial/Free Subscription/Visual Studio MCT Subscriptions, in the region selected 'éastus' a quota limit of 6 vcore applies. As a result all subsequent provisioning for say compute for batch interencing will fail for all using trial subscription. Perhaps would be nice to put a note before deploying endpoints exercise, that there is need to adjust either region, or vm type to suceed.

Code: InferencingClientCreateDeploymentFailed
Message: InferencingClient HttpRequest error, error detail: {"errors":{"VmSize":["Not enough quota available for Standard_DS3_v2 in SubscriptionId fe0c79f9-9d6e-43fc-8d3f-d9e79d40ea49. Current usage/limit: 0/6. Additional needed: 8 Please see troubleshooting guide, available here: [https://aka.ms/oe-tsg#error-outofquota"]}

Thanks a lot!

setuc commented 4 months ago

Thank you for letting us know about the challenges of running this in the Trial / Free / MCT subscription. So If I understand it correctly, the expectation would be to add notes in the ADO Guide about the limitation and the potential issue with the ADO Agent. I can add the notes to ensure that the correct VM size is selected and/or ask the user to use the best available option under the subscription. Would that help? Or do you have other recommendations?

strugdt commented 4 months ago

Yes, adding the notes before the step or activity would definitely help. I believe more structured way could be -

To update Prequisites here to add sub-points under the first item to quickly mention that with free/MCT subscriptions may pose some limitations and users should pay close attention to the notes before deployment. It could also go to the GitHub deployment page as well.
Similar notes can be added as well on the Pre-requisites on the MAIN Accelerator ReadMe.md here
Then, add detailed notes for actions at the beginning of the activity or steps before they begin.

Also, another issue about pipeline agent is irrespective of subscription I believe. So, if we can just mention a note before Registering the model to ignore ADO pipeline error as long as AML pipeline is successful and the model is seen in registry, they are good to proceed. Hope this helps, thanks for prompt response.

setuc commented 4 months ago

Do you want to create PR with the updates? You are the best person to help us write it out and the best guidance that we can provide for the others who might try it out. Thanks @strugdt

strugdt commented 4 months ago

That makes sense. I will do it and tag once done. @setuc Thank you.

setuc commented 2 months ago

Added a note in the documentation and merged the PR https://github.com/Azure/mlops-v2/pull/125

Azure / mlops-v2