Azure / azureml-examples

Official community-driven Azure Machine Learning examples, tested with GitHub Actions.
https://docs.microsoft.com/azure/machine-learning
MIT License
1.73k stars 1.41k forks source link

Cannot invoke a batch deployment #2614

Open Z30G0D opened 1 year ago

Z30G0D commented 1 year ago

Operating System

MacOS

Version Information

azure-cli 2.50.0 *

core 2.50.0 telemetry 1.0.8

Extensions: ml 2.19.1

Dependencies: msal 1.22.0 azure-mgmt-resource 23.1.0b2

Python (Darwin) 3.10.13 (main, Aug 24 2023, 22:48:59) [Clang 14.0.3 (clang-1403.0.22.14.1)]

Steps to reproduce

When trying to invoke your azureml batch deployment example mnist for the torch model it fails. https://learn.microsoft.com/en-us/azure/machine-learning/how-to-use-batch-model-deployments?view=azureml-api-2&tabs=cli#about-this-example

JOB_NAME=$(az ml batch-endpoint invoke --name mnist-batch-dv --input https://azuremlexampledata.blob.core.windows.net/data/mnist/sample --input-type uri_folder --query name -o tsv)

Expected behavior

The job was supposed to run with the example data according to your tutorial.

Actual behavior

Getting the error:

Execution failed. User process '/azureml-envs/azureml_7594b3b934a904695f71542edf30f209/bin/python' exited with status code 42. Please check log file 'user_logs/std_log_0.txt' for error details. Error: Traceback (most recent call last): File "driver/amlbi_main.py", line 275, in <module> main() File "driver/amlbi_main.py", line 226, in main sys.exit(exitcode_candidate) SystemExit: 42

Azure Machine Learning Batch Inference Start [2023-09-02 09:59:18.451846] No started flag set. Skip creating started flag. Azure Machine Learning Batch Inference End Cleaning up all outstanding Run operations, waiting 300.0 seconds 2 items cleaning up... Cleanup took 0.14278292655944824 seconds Traceback (most recent call last): File "driver/amlbi_main.py", line 275, in <module> main() File "driver/amlbi_main.py", line 226, in main sys.exit(exitcode_candidate) SystemExit: 42

Addition information

It worked ok until last week (22nd August per my last check).

I noticed that something is wrong with the input to driver/amlbi_main.py which I can't find since its not open code.

It is getting the following --model $AZUREML_DATAREFERENCE_score_model/../ argument .

But two weeks ago the argument was: --model_name *** --model_version 1

chengyuliu-msft commented 11 months ago

Hi,

I am seeing the same error, and our customer is also encountering the same error. My customer case was that a mlflow model was deployed from UI and a batch job was submitted from UI and then the same error occurred.

mlflow==2.7.0

In my case, in beginning the jobs were marked as failed as there were errors in score.py that was customized, even though the model deployed in batch endpoint is mlflow model. user_logs folder had two std_logs (std_logs_0.txt and std_logs_1.txt). Both had the same error.

_Azure Machine Learning Batch Inference Start [2023-10-08 11:04:20.604801] No started flag set. Skip creating started flag. Azure Machine Learning Batch Inference End Cleaning up all outstanding Run operations, waiting 300.0 seconds 2 items cleaning up... Cleanup took 0.20161938667297363 seconds Traceback (most recent call last): File "driver/amlbi_main.py", line 275, in main() File "driver/amlbi_main.py", line 226, in main sys.exit(exitcodecandidate) SystemExit: 42

However, after fixing the issue in the socre.py, the jobs were completed. But under user_logs folder, there were the same two logs (std_logs_0.txt and std_logs_1.txt). But one of them (std_logs_1.txt) had simliiar error, but different exit code.

_[2023-10-08 13:30:58.311544] No started flag set. Skip creating started flag. Azure Machine Learning Batch Inference End Cleaning up all outstanding Run operations, waiting 300.0 seconds 2 items cleaning up... Cleanup took 0.1637425422668457 seconds Traceback (most recent call last): File "driver/amlbi_main.py", line 275, in main() File "driver/amlbi_main.py", line 226, in main sys.exit(exitcodecandidate) SystemExit: 41

image

What is this amlbi_main.py script ?

Z30G0D commented 11 months ago

Seems like it was fixed (execution graph for me was different than yours). This is an image of the previous execution graph, and it didn't work. Now it looks like what you wrote here and it works for me.

amlbi_main.py according to what I understand that uses your entire batch yml file. it integrates your data, environment and other args that you deployed into one python script that is running. It's not exposed to the public.

image