SDK v2 AutoML experiment: get model and metrics for any algorithm (not only for the best one)

Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.

MIT License

4.36k stars 2.71k forks source link

SDK v2 AutoML experiment: get model and metrics for any algorithm (not only for the best one) #28904

Open movingabout opened 1 year ago

movingabout commented 1 year ago

What I'm trying to do

For an AutoML Forecasting experiment, I'd like to compare the performance of the best model with the performance of another model from the same experiment.

For an AutoML run, I understand how to get the best performing model and its metrics like this:

# ...initialize MLFlow client...
mlflow_parent_run = mlflow_client.get_run('upbeat_square_abs3942')
best_child_run_id = mlflow_parent_run.data.tags["automl_best_child_run_id"]
best_run = mlflow_client.get_run(best_child_run_id)
best_run.data.metrics
# etc...

But how can I fetch the job for any model based on the algorithm name? Something like:

# pseudocode:
mlflow_client.get_automl_run_for_algorithm('XGBoostRegressor')

So far, I managed to figure out the following:

list of algorithms used in the AutoML experiment
```
mlflow_parent_run.data.tags['pipeline_id_000']
#  '__AutoML_Naive__;__AutoML_SeasonalNaive__;__AutoML_Average__;__AutoML_SeasonalAverage__;__AutoML_Ensemble__'
```
However, this list seems to be in an arbitrary order and I struggle to get the corresponding job names for the algorithms.
"internal" job names for the child runs

The child runs seem to have different names than the names shown in Azure ML Studio. They are named for instance upbeat_square_abs3942_2 - i.e. the name of the parent run upbeat_square_abs3942 followed by underscore plus a number (_2in this example).

But Azure ML Studio displays names like (no upbeat_square_abs3942_2 to be found):

So this code works:

child_run = mlflow_client.get_run('upbeat_square_abs3942_2')

but using a name shown in the screenshot above throws an exception, e.g.

child_run = mlflow_client.get_run('green_floor_0ln3tlpv')

Question

How can I obtain the model and metrics for any algorithm used in the experiment?

Thanks!

ghost commented 1 year ago

Thank you for your feedback. This has been routed to the support team for assistance.

navba-MSFT commented 1 year ago

@movingabout Thanks for reaching out to us and reporting this issue. AFAIK the MLflow API does not provide a direct method to get the AutoML run for a specific algorithm. However, you can still achieve this by iterating through all the child runs of the AutoML parent run and checking the algorithm name of each run. Here's some pseudocode to get you started:

# ...initialize MLflow client...
parent_run_id = 'upbeat_square_abs3942'
target_algorithm = 'XGBoostRegressor'

parent_run = mlflow_client.get_run(parent_run_id)
child_runs = parent_run.get_children()

for child_run in child_runs:
    if child_run.data.params['algorithm_name'] == target_algorithm:
        # do something with the matching run

This code fetches the parent run and its children, then iterates through each child run and checks if its algorithm_name parameter matches the target algorithm you're interested in. If it does, you can perform any additional actions with that run that you need. Hope this helps.

movingabout commented 1 year ago

@navba-MSFT, thanks for the reply!

However, the parent run object does not provide a get_children() method. And when I fetch the child job directly using the naming scheme described above (e.g. upbeat_square_abs3942_2), the data.params dictionary is empty, i.e. does not contain the 'algorithm_name' key you mentioned.

Since the Python SDK v2 is quite new, does this maybe have to do with the SDK version? I'm currently using the following:

azure-ai-ml==1.3.0
azure-common==1.1.28
azure-core==1.26.2
azure-identity==1.12.0
azure-mgmt-core==1.3.2
azure-storage-blob==12.13.0
azure-storage-file-datalake==12.8.0
azure-storage-file-share==12.10.1
azureml-mlflow==1.48.0

Thanks!

navba-MSFT commented 1 year ago

@movingabout Thanks for getting back. My bad. You are correct that the get_children() method is not available on the Run object in MLflow 1.x. The get_children() method is only available starting from MLflow 2.0.

To fetch the child runs in MLflow 1.x, you can use the search_runs() method and specify the parent run ID as a filter. Here's some sample code:

# ...initialize MLflow client...
parent_run_id = 'upbeat_square_abs3942'
target_algorithm = 'XGBoostRegressor'

child_runs = mlflow_client.search_runs(
    experiment_ids=[experiment_id],
    filter_string=f"tags.mlflow.parentRunId = '{parent_run_id}'"
)

for child_run in child_runs:
    if child_run.data.params.get('algorithm_name') == target_algorithm:
        # do something with the matching run

This code uses the search_runs() method to search for all runs that have the parentRunId tag equal to the ID of the parent run. Then, for each child run, it checks if the algorithm_name parameter matches the target algorithm.

Regarding the missing algorithm_name parameter, it is possible that the AutoML experiment was run with an older version of the SDK that did not log this parameter. In that case, you may need to use a different method to identify the model you're interested in. One possibility is to look for the model with the highest performance metric among all models with the same algorithm. You can do this by filtering the child runs by algorithm and sorting them by metric value.

Here's how you can use the get_children() method to fetch child runs and their algorithm_name parameter in MLflow 2.0:

# ...initialize MLflow client...
parent_run_id = 'upbeat_square_abs3942'
target_algorithm = 'XGBoostRegressor'

parent_run = mlflow_client.get_run(parent_run_id)
child_runs = parent_run.get_children()

for child_run in child_runs:
    child_run_data = child_run.to_dictionary()
    if 'algorithm_name' in child_run_data['params'] and child_run_data['params']['algorithm_name'] == target_algorithm:
        # do something with the matching run

This code fetches the parent run and its children using the get_children() method. For each child run, it checks if the algorithm_name parameter is present in the run's parameters and if it matches the target algorithm you're interested in. If it does, you can perform any additional actions with that run that you need.

Note that in this code, we first call to_dictionary() on the child run object to convert it to a dictionary format before checking for the algorithm_name parameter. This is because the child run object returned by get_children() in MLflow 2.0 is a lightweight object that does not contain all the run data by default. Calling to_dictionary() forces the full run data to be fetched and returned in a dictionary format that includes all the run parameters.

nthandeMS commented 1 year ago

@skasturi Can you look into this? If it's possible, it seems like it would be nice to have this documented as well.

movingabout commented 1 year ago

Hi @navba-MSFT,

Thanks, search_runs() works well!

However, child_run.data.params is an empty dict. Since you mentioned this is due to the SDK version: I'm currently using azure-ai-ml==1.3.0, will version 1.4.0 (latest from Feb 07) populate child_run.data.params with the 'algorithm_name' key?

Thanks!

navba-MSFT commented 1 year ago

@movingabout Thanks for your reply. I apologize for the confusion. I believe the issue you're encountering with the data.params dictionary being empty is not related to the Azure ML SDK version. I am guessing this seems to be a limitation of the MLflow API in general.

AFAIK, In AutoML runs, the algorithm_name parameter is set by the AutoML framework and not by the model training code. Therefore, it is not always present in the params dictionary of the child runs. It appears that the algorithm_name parameter is not always logged as a parameter in the child runs at all, even if it is set by the AutoML framework.

One workaround for this limitation is to instead use the run_name or experiment_id properties of the child runs to infer the algorithm name. For example, if you are using the naming convention upbeat_square_abs3942_XGBoostRegressor_2 for your child runs, you could extract the algorithm name from the run name using string manipulation:

for child_run in child_runs:
    if 'XGBoostRegressor' in child_run.info.run_name:
        # do something with the matching run

Alternatively, if you're using the MLflow UI to view your runs, you can filter the runs by algorithm name using the "tag" feature in the UI. When you start an AutoML run in MLflow, the framework automatically logs a "mlflow.autologging" tag to the parent run with the value "automl". You can filter the child runs by this tag in the MLflow UI, and then manually inspect the run names or other properties to identify the runs corresponding to a specific algorithm.

movingabout commented 1 year ago

Hi @navba-MSFT,

As far as I understand it, I cannot influence the naming (or tagging) of the child runs of an AutoML job. Or am I wrong here?

For the run mentioned above, I see the following jobs in the AML Studio (Tab "Child jobs"): Some of them seem to be jobs training actual models. While others seem to be kind of like "meta-jobs" that set up the AutoML runs. I am not aware of any way to specify the job names when defining and submitting the automl.forecasting job.

And I see a list of models (Tab "Models"): The easiest way (I presume) would be to somehow access the underlying data that is shown in the "Models" Tab. Because clicking on any given model entry (say "SeasonalNaive") in the list shown in the screenshot leads right to the corresponding child job. Then I'd know the algorithm and could access all the metrics, artifacts etc.

ghost commented 1 year ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github, @Azure/azure-ml-sdk.

Issue Details

## What I'm trying to do For an **AutoML Forecasting** experiment, I'd like to compare the performance of the **best model** with the performance of **another model** from the same experiment. For an AutoML run, I understand how to get the best performing model and its metrics like this: ``` # ...initialize MLFlow client... mlflow_parent_run = mlflow_client.get_run('upbeat_square_abs3942') best_child_run_id = mlflow_parent_run.data.tags["automl_best_child_run_id"] best_run = mlflow_client.get_run(best_child_run_id) best_run.data.metrics # etc... ``` But how can I fetch the job for _any_ model based on the _algorithm name_? Something like: ``` # pseudocode: mlflow_client.get_automl_run_for_algorithm('XGBoostRegressor') ``` ## So far, I managed to figure out the following: 1. list of algorithms used in the AutoML experiment ``` mlflow_parent_run.data.tags['pipeline_id_000'] # '__AutoML_Naive__;__AutoML_SeasonalNaive__;__AutoML_Average__;__AutoML_SeasonalAverage__;__AutoML_Ensemble__' ``` However, this list seems to be in an arbitrary order and I struggle to get the corresponding job names for the algorithms. 2. "internal" job names for the child runs The child runs seem to have different names than the names shown in Azure ML Studio. They are named for instance `upbeat_square_abs3942_2` - i.e. the name of the parent run `upbeat_square_abs3942` followed by **underscore plus a number** (`_2`in this example). But Azure ML Studio displays names like (no `upbeat_square_abs3942_2` to be found):

So this code works: ``` child_run = mlflow_client.get_run('upbeat_square_abs3942_2') ``` but using a name shown in the screenshot above throws an exception, e.g. ``` child_run = mlflow_client.get_run('green_floor_0ln3tlpv') ``` ### Question How can I obtain the model and metrics for any algorithm used in the experiment? Thanks!

Author:	movingabout
Assignees:	luigiw
Labels:	`question`, `Machine Learning`, `Service Attention`, `customer-reported`, `needs-team-attention`, `CXP Attention`
Milestone:	-