Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.54k stars 2.76k forks source link

[azureml-mlflow] Mlflow fails to read model from Azure ML registry when inside component in Pipeline Job #32353

Open hugobettmach opened 11 months ago

hugobettmach commented 11 months ago

Describe the bug I'm trying to load a model from an Azure ML registry using mlflow.

To Reproduce Steps to reproduce the behavior:

  1. The following code works for me when working locally but fails when ran inside a component in a Pipeline Job:
import mlflow
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

registry_name = "test-registry"
model_name = "test"
model_version = 1

credential = DefaultAzureCredential()
ml_client_registry = MLClient(credential=credential, registry_name=registry_name)

mlflow_tracking_uri = ml_client_registry.registries.get(registry_name).mlflow_registry_uri
mlflow.set_tracking_uri(mlflow_tracking_uri)

model = mlflow.tensorflow.load_model(model_uri=f"models:/{model_name}/{model_version}")
  1. Here is the error traceback I get.
Traceback (most recent call last):
  File "/mnt/azureml/cr/j/0a69f98038724bdebdbdf489b4f8be74/exe/wd/component.py", line 100, in <module>
    asset_io_component(
  File "/mnt/azureml/cr/j/0a69f98038724bdebdbdf489b4f8be74/exe/wd/component.py", line 53, in asset_io_component
    sktransformer = mlflow.sklearn.load_model(
  File "/opt/miniconda/envs/aion-cpu-base/lib/python3.10/site-packages/mlflow/sklearn/__init__.py", line 610, in load_model
    local_model_path = _download_artifact_from_uri(artifact_uri=model_uri, output_path=dst_path)
  File "/opt/miniconda/envs/aion-cpu-base/lib/python3.10/site-packages/mlflow/tracking/artifact_utils.py", line 100, in _download_artifact_from_uri
    return get_artifact_repository(artifact_uri=root_uri).download_artifacts(
  File "/opt/miniconda/envs/aion-cpu-base/lib/python3.10/site-packages/mlflow/store/artifact/artifact_repository_registry.py", line 115, in get_artifact_repository
    return _artifact_repository_registry.get_artifact_repository(artifact_uri)
  File "/opt/miniconda/envs/aion-cpu-base/lib/python3.10/site-packages/mlflow/store/artifact/artifact_repository_registry.py", line 72, in get_artifact_repository
    return repository(artifact_uri)
  File "/opt/miniconda/envs/aion-cpu-base/lib/python3.10/site-packages/mlflow/store/artifact/models_artifact_repo.py", line 44, in __init__
    uri = ModelsArtifactRepository.get_underlying_uri(artifact_uri)
  File "/opt/miniconda/envs/aion-cpu-base/lib/python3.10/site-packages/mlflow/store/artifact/models_artifact_repo.py", line 77, in get_underlying_uri
    client = MlflowClient(registry_uri=databricks_profile_uri)
  File "/opt/miniconda/envs/aion-cpu-base/lib/python3.10/site-packages/mlflow/tracking/client.py", line 81, in __init__
    self._tracking_client = TrackingServiceClient(final_tracking_uri)
  File "/opt/miniconda/envs/aion-cpu-base/lib/python3.10/site-packages/mlflow/tracking/_tracking_service/client.py", line 51, in __init__
    self.store
  File "/opt/miniconda/envs/aion-cpu-base/lib/python3.10/site-packages/mlflow/tracking/_tracking_service/client.py", line 55, in store
    return utils._get_store(self.tracking_uri)
  File "/opt/miniconda/envs/aion-cpu-base/lib/python3.10/site-packages/mlflow/tracking/_tracking_service/utils.py", line 214, in _get_store
    return _tracking_store_registry.get_store(store_uri, artifact_uri)
  File "/opt/miniconda/envs/aion-cpu-base/lib/python3.10/site-packages/mlflow/tracking/_tracking_service/registry.py", line 39, in get_store
    return self._get_store_with_resolved_uri(resolved_store_uri, artifact_uri)
  File "/opt/miniconda/envs/aion-cpu-base/lib/python3.10/site-packages/mlflow/tracking/_tracking_service/registry.py", line 49, in _get_store_with_resolved_uri
    return builder(store_uri=resolved_store_uri, artifact_uri=artifact_uri)
  File "/opt/miniconda/envs/aion-cpu-base/lib/python3.10/site-packages/azureml/mlflow/entry_point_loaders.py", line 37, in azureml_store_builder
    service_context = _AzureMLServiceContextLoader.load_service_context(store_uri)
  File "/opt/miniconda/envs/aion-cpu-base/lib/python3.10/site-packages/azureml/mlflow/_internal/service_context_loader.py", line 98, in load_service_context
    get_service_context_from_registry_url_mlflow_env_vars(
  File "/opt/miniconda/envs/aion-cpu-base/lib/python3.10/site-packages/azureml/mlflow/_internal/utils.py", line 608, in get_service_context_from_registry_url_mlflow_env_vars
    auth = AzureMLTokenAuthentication.create(
TypeError: AzureMLTokenAuthentication.create() got an unexpected keyword argument 'registry_name'

Expected behavior I would expect this to work both locally and in the component. Locally I get the model object as expected.

Additional context

The client authenticates correctly inside the component as I have given access to the User-assigned managed identity attached to the cluster to the Azure ML registry. I can run operations using MLClient to register or read assets from the registry.

I seem to have found what the problem is by looking at the code inside azureml-mlflow:

  1. In the file azureml.mlflow._internal.utils line 608 there is a call to this method, and registry_name is passed as an arg:
    auth = AzureMLTokenAuthentication.create(
        azureml_access_token=token,
        expiry_time=None,
        host=host_url,
        subscription_id=subscription_id,
        resource_group_name=resource_group_name,
        registry_name=registry_name,
        workspace_name=None,
        experiment_name=experiment_name,
        experiment_id=experiment_id,
        run_id=run_id,
    )
  1. But when looking at the method definition in azureml.mlflow._common._authentication.azureml_token_authentication line 169, registry_name is not in the list of args and I don't see logic in the code to work with it:
    @classmethod
    def create(cls, azureml_access_token, expiry_time, host, subscription_id,
               resource_group_name, workspace_name, experiment_name, run_id, user_email=None, experiment_id=None):
github-actions[bot] commented 11 months ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github @Azure/azure-ml-sdk.

singankit commented 10 months ago

Thanks hugobettmach if you set the registry uri this would work using mlflow.set_registry_uri instead or setting tracking uri.

Why this would work. Once you set registry uri it will tell mlflow that you are looking for model in AML Registry and not in AML Workspace.

hugobettmach commented 10 months ago

Thanks. This avoids the other error indeed but now it seems like it can't find the model:

---------------------------------------------------------------------------
MlflowException                           Traceback (most recent call last)

File [~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/tensorflow/__init__.py:603](https://file+.vscode-resource.vscode-cdn.net/Users/HBETT/Projects/aion/runs/~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/tensorflow/__init__.py:603), in load_model(model_uri, dst_path, saved_model_kwargs, keras_model_kwargs)
    555 """
    556 Load an MLflow model that contains the TensorFlow flavor from the specified path.
    557 
   (...)
    599         ]
    600 """
    601 import tensorflow
--> 603 local_model_path = _download_artifact_from_uri(artifact_uri=model_uri, output_path=dst_path)
    605 model_configuration_path = os.path.join(local_model_path, MLMODEL_FILE_NAME)
    606 model_conf = Model.load(model_configuration_path)

File [~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/tracking/artifact_utils.py:100](https://file+.vscode-resource.vscode-cdn.net/Users/HBETT/Projects/aion/runs/~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/tracking/artifact_utils.py:100), in _download_artifact_from_uri(artifact_uri, output_path)
     94 """
     95 :param artifact_uri: The *absolute* URI of the artifact to download.
     96 :param output_path: The local filesystem path to which to download the artifact. If unspecified,
     97                     a local output path will be created.
     98 """
     99 root_uri, artifact_path = _get_root_uri_and_artifact_path(artifact_uri)
--> 100 return get_artifact_repository(artifact_uri=root_uri).download_artifacts(
    101     artifact_path=artifact_path, dst_path=output_path
    102 )

File [~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/artifact/artifact_repository_registry.py:115](https://file+.vscode-resource.vscode-cdn.net/Users/HBETT/Projects/aion/runs/~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/artifact/artifact_repository_registry.py:115), in get_artifact_repository(artifact_uri)
    105 def get_artifact_repository(artifact_uri):
    106     """Get an artifact repository from the registry based on the scheme of artifact_uri
    107 
    108     :param artifact_uri: The artifact store URI. This URI is used to select which artifact
   (...)
    113              requirements.
    114     """
--> 115     return _artifact_repository_registry.get_artifact_repository(artifact_uri)

File [~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/artifact/artifact_repository_registry.py:72](https://file+.vscode-resource.vscode-cdn.net/Users/HBETT/Projects/aion/runs/~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/artifact/artifact_repository_registry.py:72), in ArtifactRepositoryRegistry.get_artifact_repository(self, artifact_uri)
     67 if repository is None:
     68     raise MlflowException(
     69         f"Could not find a registered artifact repository for: {artifact_uri}. "
     70         f"Currently registered schemes are: {list(self._registry.keys())}"
     71     )
---> 72 return repository(artifact_uri)

File [~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/artifact/models_artifact_repo.py:44](https://file+.vscode-resource.vscode-cdn.net/Users/HBETT/Projects/aion/runs/~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/artifact/models_artifact_repo.py:44), in ModelsArtifactRepository.__init__(self, artifact_uri)
     42     self.repo = DatabricksModelsArtifactRepository(artifact_uri)
     43 else:
---> 44     uri = ModelsArtifactRepository.get_underlying_uri(artifact_uri)
     45     self.repo = get_artifact_repository(uri)

File [~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/artifact/models_artifact_repo.py:79](https://file+.vscode-resource.vscode-cdn.net/Users/HBETT/Projects/aion/runs/~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/artifact/models_artifact_repo.py:79), in ModelsArtifactRepository.get_underlying_uri(uri)
     77 client = MlflowClient(registry_uri=databricks_profile_uri)
     78 (name, version) = get_model_name_and_version(client, uri)
---> 79 download_uri = client.get_model_version_download_uri(name, version)
     80 return add_databricks_profile_info_to_artifact_uri(download_uri, databricks_profile_uri)

File [~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/tracking/client.py:3007](https://file+.vscode-resource.vscode-cdn.net/Users/HBETT/Projects/aion/runs/~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/tracking/client.py:3007), in MlflowClient.get_model_version_download_uri(self, name, version)
   2963 def get_model_version_download_uri(self, name: str, version: str) -> str:
   2964     """
   2965     Get the download location in Model Registry for this model version.
   2966 
   (...)
   3005         Download URI: runs:/027d7bbe81924c5a82b3e4ce979fcab7/sklearn-model
   3006     """
-> 3007     return self._get_registry_client().get_model_version_download_uri(name, version)

File [~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/tracking/_model_registry/client.py:289](https://file+.vscode-resource.vscode-cdn.net/Users/HBETT/Projects/aion/runs/~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/tracking/_model_registry/client.py:289), in ModelRegistryClient.get_model_version_download_uri(self, name, version)
    281 def get_model_version_download_uri(self, name, version):
    282     """
    283     Get the download location in Model Registry for this model version.
    284 
   (...)
    287     :return: A single URI location that allows reads for downloading.
    288     """
--> 289     return self.store.get_model_version_download_uri(name, version)

File [~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/model_registry/file_store.py:716](https://file+.vscode-resource.vscode-cdn.net/Users/HBETT/Projects/aion/runs/~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/model_registry/file_store.py:716), in FileStore.get_model_version_download_uri(self, name, version)
    706 def get_model_version_download_uri(self, name, version):
    707     """
    708     Get the download location in Model Registry for this model version.
    709     NOTE: For first version of Model Registry, since the models are not copied over to another
   (...)
    714     :return: A single URI location that allows reads for downloading.
    715     """
--> 716     model_version = self.get_model_version(name, version)
    717     return model_version.source

File [~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/model_registry/file_store.py:704](https://file+.vscode-resource.vscode-cdn.net/Users/HBETT/Projects/aion/runs/~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/model_registry/file_store.py:704), in FileStore.get_model_version(self, name, version)
    702 _validate_model_name(name)
    703 _validate_model_version(version)
--> 704 return self._fetch_model_version_if_exists(name, version)

File [~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/model_registry/file_store.py:680](https://file+.vscode-resource.vscode-cdn.net/Users/HBETT/Projects/aion/runs/~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/model_registry/file_store.py:680), in FileStore._fetch_model_version_if_exists(self, name, version)
    679 def _fetch_model_version_if_exists(self, name, version):
--> 680     registered_model_version_dir = self._get_model_version_dir(name, version)
    681     if not exists(registered_model_version_dir):
    682         raise MlflowException(
    683             f"Model Version (name={name}, version={version}) not found",
    684             RESOURCE_DOES_NOT_EXIST,
    685         )

File [~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/model_registry/file_store.py:492](https://file+.vscode-resource.vscode-cdn.net/Users/HBETT/Projects/aion/runs/~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/model_registry/file_store.py:492), in FileStore._get_model_version_dir(self, name, version)
    490 registered_model_path = self._get_registered_model_path(name)
    491 if not exists(registered_model_path):
--> 492     raise MlflowException(
    493         f"Registered Model with name={name} not found",
    494         RESOURCE_DOES_NOT_EXIST,
    495     )
    496 return join(registered_model_path, f"version-{version}")

MlflowException: Registered Model with name=model_1 not found
beckyvdh commented 9 months ago

@singankit and @hugobettmach, has there been any progress on this issue? I've encountered the same problem where I can load a model outside of a pipeline component but cannot within a pipeline component. In both cases I'm using the same conda environment (created from a highly prescriptive yml file after using conda export. In both cases I can navigate to and see the model file, and even check that it has the same byte sum. But only in the pipeline component, h2o.load_model claims the model file is not present.

Code


import os
import re
import hashlib

from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
from mlflow.store.artifact.models_artifact_repo import ModelsArtifactRepository
import h2o
h2o.init()

from utilities import constants

ml_client = MLClient(DefaultAzureCredential(), constants.si, constants.rg, constants.wk)

#get model
model_uri = f"models:/registered_model_dummy_h2o/latest"
print(model_uri)

get_path_path = ModelsArtifactRepository(model_uri).download_artifacts(artifact_path="model.h2o/h2o.yaml")
with open(get_path_path, "r") as file:
    txt = file.read()
r = re.search("model_file: (.*)",txt)
download_name = r[0].replace('model_file: ','')
download_path = ModelsArtifactRepository(model_uri).download_artifacts(artifact_path="model.h2o")

print("\n\nAttempting to load the following model object:")
print(f"\n{download_path}")
print(f"{download_name}")

print("\n\nChange directories and list contents to see if it is getting properly downloaded to a temporary location:")
os.chdir(download_path)
print(os.getcwd())
print(os.listdir())

print("\n\nThe model file is present, and has an identical byte sum in and out of the pipeline job:")
print(f"\n{hashlib.md5(open(download_path + '/' + download_name,'rb').read()).hexdigest()}")

print("\n\nWe have an identical version of h2o")
print(h2o.__version__)

print("\n\nBut h2o.load_model() fails only in the pipeline job")
model = h2o.load_model(download_path + '/' + download_name)
print("\nSuccessfully loaded model.")`

Output in pipeline job:

models:/registered_model_dummy_h2o/latest

Attempting to load the following model object:
/tmp/tmpl8zew7dw/dummy_model/model.h2o
DRF_model_python_1701098611982_3

Change directories and list contents to see if it is getting properly downloaded to a temporary location:
/tmp/tmpl8zew7dw/dummy_model/model.h2o
['h2o.yaml', 'DRF_model_python_1701098611982_3']

The model file is present, and has an identical byte sum in and out of the pipeline job:
**438ff14d7d17b1d11fb39d16ce000730**

We have an identical version of h2o
3.42.0.3

But h2o.load_model() fails only in the pipeline job
Traceback (most recent call last):
  File "dummy_script.py", line 51, in <module>
    model = h2o.load_model(download_path + '/' + download_name)
  File "/azureml-envs/azureml_92857f29549237e5f99fedcba7b31008/lib/python3.8/site-packages/h2o/h2o.py", line 1579, in load_model
    res = api("POST /99/Models.bin/%s" % "", data={"dir": path})
  File "/azureml-envs/azureml_92857f29549237e5f99fedcba7b31008/lib/python3.8/site-packages/h2o/h2o.py", line 122, in api
    return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)
  File "/azureml-envs/azureml_92857f29549237e5f99fedcba7b31008/lib/python3.8/site-packages/h2o/backend/connection.py", line 499, in request
    return self._process_response(resp, save_to)
  File "/azureml-envs/azureml_92857f29549237e5f99fedcba7b31008/lib/python3.8/site-packages/h2o/backend/connection.py", line 853, in _process_response
    raise H2OResponseError(data)
h2o.exceptions.H2OResponseError: Server error water.exceptions.H2OIllegalArgumentException:
  Error: Illegal argument: dir of function: importModel: water.api.FSIOException: FS IO Failure: 
 accessed path : [file:/tmp/tmpl8zew7dw/dummy_model/model.h2o/DRF_model_python_1701098611982_3](file://tmp/tmpl8zew7dw/dummy_model/model.h2o/DRF_model_python_1701098611982_3) msg: File not found
  Request: POST /99/Models.bin/
    data: {'dir': '/tmp/tmpl8zew7dw/dummy_model/model.h2o/DRF_model_python_1701098611982_3'}

Output outside of pipeline job

models:/registered_model_dummy_h2o/latest

Attempting to load the following model object:

/tmp/tmpdbd98s52/dummy_model/model.h2o
DRF_model_python_1701098611982_3

Change directories and list contents to see if it is getting properly downloaded to a temporary location:
/tmp/tmpdbd98s52/dummy_model/model.h2o
['h2o.yaml', 'DRF_model_python_1701098611982_3']

The model file is present, and has an identical byte sum in and out of the pipeline job:
**438ff14d7d17b1d11fb39d16ce000730**

We have an identical version of h2o
3.42.0.3

But h2o.load_model() fails only in the pipeline job
Successfully loaded model.
hugobettmach commented 9 months ago

Hi @beckyvdh, no not from my side. I eventually worked around the problem and downloaded the models using the ml_client itself. But what you are experiencing is different, I didn't use to have problems loading models that were registered in the Azure ML workspace. My issue is with models registered in an Azure ML registry.

Worked around with something like this:

ml_client_registry.models.download(name=model_name, version=version, download_path=".")

Then I can load it with the from the path.

beckyvdh commented 9 months ago

Thanks @hugobettmach, makes sense. I also found a workaround by forcing h2o to close any existing JVMs (and therefore start a fresh JVM) each time I ran the pipeline. I do not know why this was required for a pipeline component, but not when just running the .py file, but at least it works.