Open Z30G0D opened 1 year ago
Hi,
I am seeing the same error, and our customer is also encountering the same error. My customer case was that a mlflow model was deployed from UI and a batch job was submitted from UI and then the same error occurred.
mlflow==2.7.0
In my case, in beginning the jobs were marked as failed as there were errors in score.py that was customized, even though the model deployed in batch endpoint is mlflow model. user_logs folder had two std_logs (std_logs_0.txt and std_logs_1.txt). Both had the same error.
_Azure Machine Learning Batch Inference Start
[2023-10-08 11:04:20.604801] No started flag set. Skip creating started flag.
Azure Machine Learning Batch Inference End
Cleaning up all outstanding Run operations, waiting 300.0 seconds
2 items cleaning up...
Cleanup took 0.20161938667297363 seconds
Traceback (most recent call last):
File "driver/amlbi_main.py", line 275, in
However, after fixing the issue in the socre.py, the jobs were completed. But under user_logs folder, there were the same two logs (std_logs_0.txt and std_logs_1.txt). But one of them (std_logs_1.txt) had simliiar error, but different exit code.
_[2023-10-08 13:30:58.311544] No started flag set. Skip creating started flag.
Azure Machine Learning Batch Inference End
Cleaning up all outstanding Run operations, waiting 300.0 seconds
2 items cleaning up...
Cleanup took 0.1637425422668457 seconds
Traceback (most recent call last):
File "driver/amlbi_main.py", line 275, in
What is this amlbi_main.py script ?
Seems like it was fixed (execution graph for me was different than yours). This is an image of the previous execution graph, and it didn't work. Now it looks like what you wrote here and it works for me.
amlbi_main.py
according to what I understand that uses your entire batch yml
file.
it integrates your data, environment and other args that you deployed into one python script that is running.
It's not exposed to the public.
Operating System
MacOS
Version Information
azure-cli 2.50.0 *
core 2.50.0 telemetry 1.0.8
Extensions: ml 2.19.1
Dependencies: msal 1.22.0 azure-mgmt-resource 23.1.0b2
Python (Darwin) 3.10.13 (main, Aug 24 2023, 22:48:59) [Clang 14.0.3 (clang-1403.0.22.14.1)]
Steps to reproduce
When trying to invoke your azureml batch deployment example mnist for the torch model it fails. https://learn.microsoft.com/en-us/azure/machine-learning/how-to-use-batch-model-deployments?view=azureml-api-2&tabs=cli#about-this-example
JOB_NAME=$(az ml batch-endpoint invoke --name mnist-batch-dv --input https://azuremlexampledata.blob.core.windows.net/data/mnist/sample --input-type uri_folder --query name -o tsv)
Expected behavior
The job was supposed to run with the example data according to your tutorial.
Actual behavior
Getting the error:
Execution failed. User process '/azureml-envs/azureml_7594b3b934a904695f71542edf30f209/bin/python' exited with status code 42. Please check log file 'user_logs/std_log_0.txt' for error details. Error: Traceback (most recent call last): File "driver/amlbi_main.py", line 275, in <module> main() File "driver/amlbi_main.py", line 226, in main sys.exit(exitcode_candidate) SystemExit: 42
Azure Machine Learning Batch Inference Start [2023-09-02 09:59:18.451846] No started flag set. Skip creating started flag. Azure Machine Learning Batch Inference End Cleaning up all outstanding Run operations, waiting 300.0 seconds 2 items cleaning up... Cleanup took 0.14278292655944824 seconds Traceback (most recent call last): File "driver/amlbi_main.py", line 275, in <module> main() File "driver/amlbi_main.py", line 226, in main sys.exit(exitcode_candidate) SystemExit: 42
Addition information
It worked ok until last week (22nd August per my last check).
I noticed that something is wrong with the input to
driver/amlbi_main.py
which I can't find since its not open code.It is getting the following
--model $AZUREML_DATAREFERENCE_score_model/../
argument .But two weeks ago the argument was:
--model_name *** --model_version 1