Azure / azureml-examples

Official community-driven Azure Machine Learning examples, tested with GitHub Actions.
https://docs.microsoft.com/azure/machine-learning
MIT License
1.76k stars 1.44k forks source link

Cannot invoke a batch deployment #3343

Closed CB-LTD closed 3 months ago

CB-LTD commented 3 months ago

Operating System

Windows

Version Information

I am following the guide on deploying MLFlow models to batch end points, running from AzureML notebooks with a python 3.10 - SDK v2 kernel.

Running job = ml_client.batch_endpoints.invoke(endpoint_name=endpoint.name, input=job_input) appears to invoke the endpoint, but the job fails with the error:

/azureml-envs/azureml_23ebd2c95b68f6326bb643bbad5d9607/lib/python3.8/site-packages/paramiko/pkey.py:100: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0. "cipher": algorithms.TripleDES, /azureml-envs/azureml_23ebd2c95b68f6326bb643bbad5d9607/lib/python3.8/site-packages/paramiko/transport.py:259: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0. "class": algorithms.TripleDES, Azure Machine Learning Batch Inference Start [2024-08-07 15:04:33.432774] No started flag set. Skip creating started flag. Azure Machine Learning Batch Inference End Cleaning up all outstanding Run operations, waiting 300.0 seconds 2 items cleaning up... Cleanup took 0.13996434211730957 seconds Traceback (most recent call last): File "driver/amlbi_main.py", line 275, in main() File "driver/amlbi_main.py", line 226, in main sys.exit(exitcode_candidate) SystemExit: 42

From what I can gather, it's suspected this is an issue with the scoring script however this is generated automatically nowadays by the MLFlow model when deploying to a batch end point. To get my scoring script I need to unpickle my model, which requires CUDA, so I'm waiting for some GPU quota to be approved to further investigate that route.
Similar (but slightly different) issues have been raised here and here.

Only deviations from the guide is that I uploaded & registered the model through the GUI, which it suggests is acceptable, and uploaded the dataset to ADLS gen 2 storage and read the dataset from there.

Steps to reproduce

  1. Create a AzureML notebook using the Python 3.10 - SDK v2 kernel
  2. Clone the AzureML examples repo
  3. Upload & register the heart-classifier-mlflow model
  4. Upload the heart-dataset-unlabeled to ADLS gen 2 storage
  5. Follow the Python instructions from 'connect to your workspace' on this page.
  6. Invoke the batch endpoint using job = ml_client.batch_endpoints.invoke(endpoint_name=endpoint.name, input=input), where it will initiate and soon fail

Expected behavior

The test data would be processed by the model on the endpoint, scored & written.

Actual behavior

The endpoint errored, as highlighted in the main body of this issue.

Addition information

The issue seems to be an issue with the scoring script, however the scoring script is auto-generated and, as far as I can tell, not viewable.

CB-LTD commented 3 months ago

As per LIK2RNG's comment here, you get a more useful error message by going into Logs/joberror_xxx.txt and Logs/jobresult.txt