Open hugobettmach opened 1 year ago
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github @Azure/azure-ml-sdk.
Thanks hugobettmach if you set the registry uri this would work using mlflow.set_registry_uri instead or setting tracking uri.
Why this would work. Once you set registry uri it will tell mlflow that you are looking for model in AML Registry and not in AML Workspace.
Thanks. This avoids the other error indeed but now it seems like it can't find the model:
---------------------------------------------------------------------------
MlflowException Traceback (most recent call last)
File [~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/tensorflow/__init__.py:603](https://file+.vscode-resource.vscode-cdn.net/Users/HBETT/Projects/aion/runs/~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/tensorflow/__init__.py:603), in load_model(model_uri, dst_path, saved_model_kwargs, keras_model_kwargs)
555 """
556 Load an MLflow model that contains the TensorFlow flavor from the specified path.
557
(...)
599 ]
600 """
601 import tensorflow
--> 603 local_model_path = _download_artifact_from_uri(artifact_uri=model_uri, output_path=dst_path)
605 model_configuration_path = os.path.join(local_model_path, MLMODEL_FILE_NAME)
606 model_conf = Model.load(model_configuration_path)
File [~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/tracking/artifact_utils.py:100](https://file+.vscode-resource.vscode-cdn.net/Users/HBETT/Projects/aion/runs/~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/tracking/artifact_utils.py:100), in _download_artifact_from_uri(artifact_uri, output_path)
94 """
95 :param artifact_uri: The *absolute* URI of the artifact to download.
96 :param output_path: The local filesystem path to which to download the artifact. If unspecified,
97 a local output path will be created.
98 """
99 root_uri, artifact_path = _get_root_uri_and_artifact_path(artifact_uri)
--> 100 return get_artifact_repository(artifact_uri=root_uri).download_artifacts(
101 artifact_path=artifact_path, dst_path=output_path
102 )
File [~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/artifact/artifact_repository_registry.py:115](https://file+.vscode-resource.vscode-cdn.net/Users/HBETT/Projects/aion/runs/~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/artifact/artifact_repository_registry.py:115), in get_artifact_repository(artifact_uri)
105 def get_artifact_repository(artifact_uri):
106 """Get an artifact repository from the registry based on the scheme of artifact_uri
107
108 :param artifact_uri: The artifact store URI. This URI is used to select which artifact
(...)
113 requirements.
114 """
--> 115 return _artifact_repository_registry.get_artifact_repository(artifact_uri)
File [~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/artifact/artifact_repository_registry.py:72](https://file+.vscode-resource.vscode-cdn.net/Users/HBETT/Projects/aion/runs/~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/artifact/artifact_repository_registry.py:72), in ArtifactRepositoryRegistry.get_artifact_repository(self, artifact_uri)
67 if repository is None:
68 raise MlflowException(
69 f"Could not find a registered artifact repository for: {artifact_uri}. "
70 f"Currently registered schemes are: {list(self._registry.keys())}"
71 )
---> 72 return repository(artifact_uri)
File [~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/artifact/models_artifact_repo.py:44](https://file+.vscode-resource.vscode-cdn.net/Users/HBETT/Projects/aion/runs/~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/artifact/models_artifact_repo.py:44), in ModelsArtifactRepository.__init__(self, artifact_uri)
42 self.repo = DatabricksModelsArtifactRepository(artifact_uri)
43 else:
---> 44 uri = ModelsArtifactRepository.get_underlying_uri(artifact_uri)
45 self.repo = get_artifact_repository(uri)
File [~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/artifact/models_artifact_repo.py:79](https://file+.vscode-resource.vscode-cdn.net/Users/HBETT/Projects/aion/runs/~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/artifact/models_artifact_repo.py:79), in ModelsArtifactRepository.get_underlying_uri(uri)
77 client = MlflowClient(registry_uri=databricks_profile_uri)
78 (name, version) = get_model_name_and_version(client, uri)
---> 79 download_uri = client.get_model_version_download_uri(name, version)
80 return add_databricks_profile_info_to_artifact_uri(download_uri, databricks_profile_uri)
File [~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/tracking/client.py:3007](https://file+.vscode-resource.vscode-cdn.net/Users/HBETT/Projects/aion/runs/~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/tracking/client.py:3007), in MlflowClient.get_model_version_download_uri(self, name, version)
2963 def get_model_version_download_uri(self, name: str, version: str) -> str:
2964 """
2965 Get the download location in Model Registry for this model version.
2966
(...)
3005 Download URI: runs:/027d7bbe81924c5a82b3e4ce979fcab7/sklearn-model
3006 """
-> 3007 return self._get_registry_client().get_model_version_download_uri(name, version)
File [~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/tracking/_model_registry/client.py:289](https://file+.vscode-resource.vscode-cdn.net/Users/HBETT/Projects/aion/runs/~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/tracking/_model_registry/client.py:289), in ModelRegistryClient.get_model_version_download_uri(self, name, version)
281 def get_model_version_download_uri(self, name, version):
282 """
283 Get the download location in Model Registry for this model version.
284
(...)
287 :return: A single URI location that allows reads for downloading.
288 """
--> 289 return self.store.get_model_version_download_uri(name, version)
File [~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/model_registry/file_store.py:716](https://file+.vscode-resource.vscode-cdn.net/Users/HBETT/Projects/aion/runs/~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/model_registry/file_store.py:716), in FileStore.get_model_version_download_uri(self, name, version)
706 def get_model_version_download_uri(self, name, version):
707 """
708 Get the download location in Model Registry for this model version.
709 NOTE: For first version of Model Registry, since the models are not copied over to another
(...)
714 :return: A single URI location that allows reads for downloading.
715 """
--> 716 model_version = self.get_model_version(name, version)
717 return model_version.source
File [~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/model_registry/file_store.py:704](https://file+.vscode-resource.vscode-cdn.net/Users/HBETT/Projects/aion/runs/~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/model_registry/file_store.py:704), in FileStore.get_model_version(self, name, version)
702 _validate_model_name(name)
703 _validate_model_version(version)
--> 704 return self._fetch_model_version_if_exists(name, version)
File [~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/model_registry/file_store.py:680](https://file+.vscode-resource.vscode-cdn.net/Users/HBETT/Projects/aion/runs/~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/model_registry/file_store.py:680), in FileStore._fetch_model_version_if_exists(self, name, version)
679 def _fetch_model_version_if_exists(self, name, version):
--> 680 registered_model_version_dir = self._get_model_version_dir(name, version)
681 if not exists(registered_model_version_dir):
682 raise MlflowException(
683 f"Model Version (name={name}, version={version}) not found",
684 RESOURCE_DOES_NOT_EXIST,
685 )
File [~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/model_registry/file_store.py:492](https://file+.vscode-resource.vscode-cdn.net/Users/HBETT/Projects/aion/runs/~/Projects/aion/.venv/lib/python3.10/site-packages/mlflow/store/model_registry/file_store.py:492), in FileStore._get_model_version_dir(self, name, version)
490 registered_model_path = self._get_registered_model_path(name)
491 if not exists(registered_model_path):
--> 492 raise MlflowException(
493 f"Registered Model with name={name} not found",
494 RESOURCE_DOES_NOT_EXIST,
495 )
496 return join(registered_model_path, f"version-{version}")
MlflowException: Registered Model with name=model_1 not found
@singankit and @hugobettmach, has there been any progress on this issue? I've encountered the same problem where I can load a model outside of a pipeline component but cannot within a pipeline component. In both cases I'm using the same conda environment (created from a highly prescriptive yml file after using conda export.
In both cases I can navigate to and see the model file, and even check that it has the same byte sum. But only in the pipeline component, h2o.load_model
claims the model file is not present.
Code
import os
import re
import hashlib
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
from mlflow.store.artifact.models_artifact_repo import ModelsArtifactRepository
import h2o
h2o.init()
from utilities import constants
ml_client = MLClient(DefaultAzureCredential(), constants.si, constants.rg, constants.wk)
#get model
model_uri = f"models:/registered_model_dummy_h2o/latest"
print(model_uri)
get_path_path = ModelsArtifactRepository(model_uri).download_artifacts(artifact_path="model.h2o/h2o.yaml")
with open(get_path_path, "r") as file:
txt = file.read()
r = re.search("model_file: (.*)",txt)
download_name = r[0].replace('model_file: ','')
download_path = ModelsArtifactRepository(model_uri).download_artifacts(artifact_path="model.h2o")
print("\n\nAttempting to load the following model object:")
print(f"\n{download_path}")
print(f"{download_name}")
print("\n\nChange directories and list contents to see if it is getting properly downloaded to a temporary location:")
os.chdir(download_path)
print(os.getcwd())
print(os.listdir())
print("\n\nThe model file is present, and has an identical byte sum in and out of the pipeline job:")
print(f"\n{hashlib.md5(open(download_path + '/' + download_name,'rb').read()).hexdigest()}")
print("\n\nWe have an identical version of h2o")
print(h2o.__version__)
print("\n\nBut h2o.load_model() fails only in the pipeline job")
model = h2o.load_model(download_path + '/' + download_name)
print("\nSuccessfully loaded model.")`
Output in pipeline job:
models:/registered_model_dummy_h2o/latest
Attempting to load the following model object:
/tmp/tmpl8zew7dw/dummy_model/model.h2o
DRF_model_python_1701098611982_3
Change directories and list contents to see if it is getting properly downloaded to a temporary location:
/tmp/tmpl8zew7dw/dummy_model/model.h2o
['h2o.yaml', 'DRF_model_python_1701098611982_3']
The model file is present, and has an identical byte sum in and out of the pipeline job:
**438ff14d7d17b1d11fb39d16ce000730**
We have an identical version of h2o
3.42.0.3
But h2o.load_model() fails only in the pipeline job
Traceback (most recent call last):
File "dummy_script.py", line 51, in <module>
model = h2o.load_model(download_path + '/' + download_name)
File "/azureml-envs/azureml_92857f29549237e5f99fedcba7b31008/lib/python3.8/site-packages/h2o/h2o.py", line 1579, in load_model
res = api("POST /99/Models.bin/%s" % "", data={"dir": path})
File "/azureml-envs/azureml_92857f29549237e5f99fedcba7b31008/lib/python3.8/site-packages/h2o/h2o.py", line 122, in api
return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)
File "/azureml-envs/azureml_92857f29549237e5f99fedcba7b31008/lib/python3.8/site-packages/h2o/backend/connection.py", line 499, in request
return self._process_response(resp, save_to)
File "/azureml-envs/azureml_92857f29549237e5f99fedcba7b31008/lib/python3.8/site-packages/h2o/backend/connection.py", line 853, in _process_response
raise H2OResponseError(data)
h2o.exceptions.H2OResponseError: Server error water.exceptions.H2OIllegalArgumentException:
Error: Illegal argument: dir of function: importModel: water.api.FSIOException: FS IO Failure:
accessed path : [file:/tmp/tmpl8zew7dw/dummy_model/model.h2o/DRF_model_python_1701098611982_3](file://tmp/tmpl8zew7dw/dummy_model/model.h2o/DRF_model_python_1701098611982_3) msg: File not found
Request: POST /99/Models.bin/
data: {'dir': '/tmp/tmpl8zew7dw/dummy_model/model.h2o/DRF_model_python_1701098611982_3'}
Output outside of pipeline job
models:/registered_model_dummy_h2o/latest
Attempting to load the following model object:
/tmp/tmpdbd98s52/dummy_model/model.h2o
DRF_model_python_1701098611982_3
Change directories and list contents to see if it is getting properly downloaded to a temporary location:
/tmp/tmpdbd98s52/dummy_model/model.h2o
['h2o.yaml', 'DRF_model_python_1701098611982_3']
The model file is present, and has an identical byte sum in and out of the pipeline job:
**438ff14d7d17b1d11fb39d16ce000730**
We have an identical version of h2o
3.42.0.3
But h2o.load_model() fails only in the pipeline job
Successfully loaded model.
Hi @beckyvdh, no not from my side. I eventually worked around the problem and downloaded the models using the ml_client
itself. But what you are experiencing is different, I didn't use to have problems loading models that were registered in the Azure ML workspace. My issue is with models registered in an Azure ML registry.
Worked around with something like this:
ml_client_registry.models.download(name=model_name, version=version, download_path=".")
Then I can load it with the from the path.
Thanks @hugobettmach, makes sense. I also found a workaround by forcing h2o to close any existing JVMs (and therefore start a fresh JVM) each time I ran the pipeline. I do not know why this was required for a pipeline component, but not when just running the .py file, but at least it works.
Describe the bug I'm trying to load a model from an Azure ML registry using
mlflow
.To Reproduce Steps to reproduce the behavior:
Expected behavior I would expect this to work both locally and in the component. Locally I get the model object as expected.
Additional context
The client authenticates correctly inside the component as I have given access to the User-assigned managed identity attached to the cluster to the Azure ML registry. I can run operations using MLClient to register or read assets from the registry.
I seem to have found what the problem is by looking at the code inside
azureml-mlflow
:azureml.mlflow._internal.utils
line 608 there is a call to this method, andregistry_name
is passed as an arg:azureml.mlflow._common._authentication.azureml_token_authentication
line 169,registry_name
is not in the list of args and I don't see logic in the code to work with it: