Azure / MachineLearningNotebooks

Python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK | Microsoft
https://docs.microsoft.com/azure/machine-learning/service/
MIT License
4.01k stars 2.49k forks source link

Azure ML - Retrieve an AutoMLStep model and use for inference #1838

Closed nabil-fcbqa closed 1 year ago

nabil-fcbqa commented 1 year ago

Hello everyone,

I am currently trying to use the AutoMLStep to train a machine learning model, register it in the workspace, and use it for inference as a deserialized model. My current project folder/file structure is the following:

project/
│
├── src/
             │
             ├──data_prep.py
             ├──register_model.py
├── pipeline.py

(mostly basing my work on https://learn.microsoft.com/en-us/azure/machine-learning/v1/how-to-use-automlstep-in-pipelines) In the pipeline.py script, I create my pipeline PythonScriptStep objects (2 in this case) as well as the AutoMLStep. The AutoMLStep is defined as follow:

train_step = AutoMLStep(name='AutoML_Classification',
    automl_config=automl_config,
    passthru_automl_config=False,
    outputs=[metrics_data, model_data],
    allow_reuse=True)

Where:

metrics_data = PipelineData(name='metrics_data',
                           datastore=blobstore,
                           pipeline_output_name=metrics_output_name,
                           training_output=TrainingOutput(type='Metrics'))
model_data = PipelineData(name='model_data',
                           datastore=blobstore,
                           pipeline_output_name='best_model_ticketing',
                           training_output=TrainingOutput(type='Model'))

For the register_model.py script, which is the last step in my pipeline sequence, I want to register the model, and use it to make predictions. I've tried the following:

from azureml.core.model import Model, Dataset
from azureml.core.run import Run, _OfflineRun
from azureml.core import Workspace
import argparse
import os
import pickle

from azureml.pipeline.core import PipelineRun
from azureml.pipeline.steps.automl_step import AutoMLStepRun

parser = argparse.ArgumentParser()
parser.add_argument("--model_name", required=True)
parser.add_argument("--model_path", required=True)
args = parser.parse_args()

run = Run.get_context()
ws = Workspace.from_config() if type(run) == _OfflineRun else run.experiment.workspace

pipeline_run_id = run.parent.id
pipeline_run = PipelineRun(experiment=run.experiment, run_id=pipeline_run_id)  # This is the Pipeline run, that orchestrates the overall pipeline
best_model_output = pipeline_run.get_pipeline_output('best_model_ticketing')
num_file_downloaded = best_model_output.download('.', show_progress=True)

model_filename = best_model_output._path_on_datastore
with open(model_filename, "rb" ) as f:
    best_model = pickle.load(f)

file_name = f"../outputs/model/{args.model_name}.pkl"
os.makedirs(os.path.dirname(file_name), exist_ok=True)
pickle.dump(value = best_model, filename = file_name)
print("Pickeling of model complete")

# Register model in AzureML
model = Model.register(model_path = file_name,
                       model_name = args.model_name,
                        description = "Model, with Hyperparameters Tuned",
                        workspace = ws)

Which leads to

Traceback (most recent call last):
  File "src/register_model.py", line 26, in <module>
    best_model = pickle.load(f)
EOFError: Ran out of input

Ideally, to integrate this with my current project script, I'd like to use a similar approach to this:

# Begin pickling the model
# non AutoML training done prior to this to create best_xgb_model in same script
print("Begin pickling the model")
model_name = args.registered_model_name

# save model in ./model
print("Exporting model as a .pkl")

import os
file_name = f"../outputs/model/{model_name}.pkl"
os.makedirs(os.path.dirname(file_name), exist_ok=True)
joblib.dump(value = best_xgb_model, filename = file_name)
print("Pickeling of model complete")

# Register model in AzureML
print("Registering Model with AzureML")
model = Model.register(
                        model_path = file_name,
                        model_name = model_name,
                        description = "Model, with Hyperparameters Tuned",
                        workspace = ws
                    )

Which allows the model to be used this way:

model_path = Model.get_model_path(model_name = args.registered_model_name, _workspace=ws) # get path of *latest* model
# Deserialize the model file back into xgb model
best_xgb_model = joblib.load(model_path)

Bottom line of all this is how can I retrieve the AutoMLStep best fitted model in the following step(register_model.py), in such a way that I can use a joblib.dump, register the model, and load for predictions. I've tried registering the model directly (doesnt save the model as .pkl file) and wasn't able to use for inference with the get_model_path.

Help would be greatly appreciated.