getindata / kedro-azureml

Kedro plugin to support running workflows on Microsoft Azure ML Pipelines
https://kedro-azureml.readthedocs.io
Apache License 2.0
36 stars 17 forks source link

Bug using AzureMLAssetDataset locally #147

Open robertmcleod2 opened 1 month ago

robertmcleod2 commented 1 month ago

When using the AzureMLAssetDataset it all works fine when deployed. However, I get an error locally when one pipeline outputs an AzureMLAssetDataset, and another pipeline tries to consume this asset. Here is a reproducible example:

The first pipeline:

from kedro.pipeline import Pipeline, node
import pandas as pd

def create_dataset():
    df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
    return df

def create_pipeline(**kwargs) -> Pipeline:
    return Pipeline(
        nodes=[
            node(
                func=create_dataset,
                inputs=None,
                outputs="test_raw",
                name="create_test_raw",
            ),
        ],
    )

The second pipeline:

from kedro.pipeline import Pipeline, node

def create_pipeline(**kwargs) -> Pipeline:
    return Pipeline(
        nodes = [
            node(
                func=lambda x: x,
                inputs="test_raw",
                outputs="test_raw_copy",
                name="copy_test_raw"
            )
        ],
    )

and the catalog:

test_raw:
  type: kedro_azureml.datasets.AzureMLAssetDataset
  azureml_dataset: test_raw
  root_dir: data/00_azurelocals
  versioned: true
  dataset:
    type: pandas.CSVDataset
    filepath: test_raw.csv

test_raw_copy:
  type: kedro_azureml.datasets.AzureMLAssetDataset
  azureml_dataset: test_raw_copy
  root_dir: data/00_azurelocals
  versioned: true
  dataset:
    type: pandas.CSVDataset
    filepath: test_raw_copy.csv

When running the first pipeline locally with kedro run --pipeline test, it creates a local file at data/00_azurelocals/test_raw/local/test_raw.csv. Then when running the second pipeline with kedro run --pipeline copy_test, I get the following stack trace:

(enerfore-deployment) C:\Users\Robert.McLeod2\git_repos\ptx-ds-enerfore-deployment>kedro run --pipeline copy_test
[07/17/24 16:34:09] INFO     Kedro project ptx-ds-enerfore-deployment                                                                                                                                                                                session.py:365
[07/17/24 16:34:18]                                                                                                                                                                 
                    WARNING  Replacing dataset 'test_raw'                                                                                                                                                                                       data_catalog.py:606
                    WARNING  Replacing dataset 'test_raw_copy'                                                                                                                                                                                  data_catalog.py:606
                    INFO     Loading data from 'test_raw' (AzureMLAssetDataset)...                                                                                                                                                              data_catalog.py:502
Found the config file in: C:\Users\ROBERT~1.MCL\AppData\Local\Temp\tmpxxwei3q5\config.json
Found the config file in: C:\Users\ROBERT~1.MCL\AppData\Local\Temp\tmp5omkfo_i\config.json
[07/17/24 16:34:31] WARNING  No nodes ran. Repeat the previous command to attempt a new run.                                                                                                                                                          runner.py:213
Traceback (most recent call last):
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\azure\ai\ml\_utils\_asset_utils.py", line 775, in _get_latest_version_from_container
    else container_operation.get(
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\azure\core\tracing\decorator.py", line 94, in wrapper_use_tracer
    return func(*args, **kwargs)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\azure\ai\ml\_restclient\v2023_04_01_preview\operations\_data_containers_operations.py", line 430, in get
    map_error(status_code=response.status_code, response=response, error_map=error_map)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\azure\core\exceptions.py", line 161, in map_error
    raise error
azure.core.exceptions.ResourceNotFoundError: (UserError) test_raw container was not found.
Code: UserError
Message: test_raw container was not found.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\azure\ai\ml\operations\_data_operations.py", line 265, in get
    return _resolve_label_to_asset(self, name, label)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\azure\ai\ml\_utils\_asset_utils.py", line 1022, in _resolve_label_to_asset
    return resolver(name)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\azure\ai\ml\operations\_data_operations.py", line 675, in _get_latest_version
    latest_version = _get_latest_version_from_container(
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\azure\ai\ml\_utils\_asset_utils.py", line 795, in _get_latest_version_from_container
    raise ValidationException(
azure.ai.ml.exceptions.ValidationException: Asset test_raw does not exist in workspace azuremlworkspace.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\io\core.py", line 193, in load
    return self._load()
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro_azureml\datasets\asset_dataset.py", line 188, in _load
    azureml_ds = self._get_azureml_dataset()
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro_azureml\datasets\asset_dataset.py", line 182, in _get_azureml_dataset
    self._azureml_dataset, version=self.resolve_load_version()
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\io\core.py", line 576, in resolve_load_version
    return self._fetch_latest_load_version()
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\cachetools\__init__.py", line 799, in wrapper
    v = method(self, *args, **kwargs)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro_azureml\datasets\asset_dataset.py", line 175, in _fetch_latest_load_version
    return self._get_latest_version()
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro_azureml\datasets\asset_dataset.py", line 169, in _get_latest_version
    return ml_client.data.get(self._azureml_dataset, label="latest").version
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\azure\ai\ml\_telemetry\activity.py", line 292, in wrapper
    return f(*args, **kwargs)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\azure\ai\ml\operations\_data_operations.py", line 279, in get
    log_and_raise_error(ex)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\azure\ai\ml\_exception_helper.py", line 337, in log_and_raise_error
    raise MlException(message=formatted_error, no_personal_data_message=formatted_error)
azure.ai.ml.exceptions.MlException:

1) Resource was not found.

Details:

(x) Asset test_raw does not exist in workspace azuremlworkspace.

Resolutions:
1) Double-check that the resource has been specified correctly and that you have access to it.
If using the CLI, you can also check the full log in debug mode for more details by adding --debug to the end of your command

Additional Resources: The easiest way to author a yaml specification file is using IntelliSense and auto-completion Azure ML VS code extension provides: https://code.visualstudio.com/docs/datascience/azure-machine-learning. To set up VS Code, visit https://docs.microsoft.com/azure/machine-learning/how-to-setup-vs-code

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\Scripts\kedro.exe\__main__.py", line 7, in <module>
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\framework\cli\cli.py", line 211, in main
    cli_collection()
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\click\core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\framework\cli\cli.py", line 139, in main
    super().main(
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\click\core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\click\core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\framework\cli\project.py", line 453, in run
    session.run(
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\framework\session\session.py", line 436, in run
    run_result = runner.run(
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\runner\runner.py", line 103, in run
    self._run(pipeline, catalog, hook_manager, session_id)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\framework\session\session.py", line 436, in run
    run_result = runner.run(
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\runner\runner.py", line 103, in run
    self._run(pipeline, catalog, hook_manager, session_id)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\runner\runner.py", line 103, in run
    self._run(pipeline, catalog, hook_manager, session_id)
    self._run(pipeline, catalog, hook_manager, session_id)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\runner\sequential_runner.py", line 70, in _run
    run_node(node, catalog, hook_manager, self._is_async, session_id)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\runner\runner.py", line 331, in run_node
    node = _run_node_sequential(node, catalog, hook_manager, session_id)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\runner\runner.py", line 414, in _run_node_sequential
    inputs[name] = catalog.load(name)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\io\data_catalog.py", line 506, in load
    result = dataset.load()
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\io\core.py", line 614, in load
    return super().load()
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\io\core.py", line 202, in load
    raise DatasetError(message) from exc
kedro.io.core.DatasetError: Failed while loading data from data set AzureMLAssetDataset(dataset_config={'filepath': test_raw.csv}, dataset_type=CSVDataset, filepath_arg=filepath, root_dir=data/00_azurelocals).

1) Resource was not found.

Details:

(x) Asset test_raw does not exist in workspace azuremlworkspace.

Resolutions:
1) Double-check that the resource has been specified correctly and that you have access to it.
If using the CLI, you can also check the full log in debug mode for more details by adding --debug to the end of your command

Additional Resources: The easiest way to author a yaml specification file is using IntelliSense and auto-completion Azure ML VS code extension provides: https://code.visualstudio.com/docs/datascience/azure-machine-learning. To set up VS Code, visit https://docs.microsoft.com/azure/machine-learning/how-to-setup-vs-code

So it seems like it is trying to find a version of the file on Azure, rather than using the local copy. When there is a version on Azure that exists, it puts the version number of the Dataset on azure in the directory path rather than local, i.e. it will look for a file at data/00_azurelocals/test_raw/4/test_raw.csv

I'm not sure why it is trying to find the dataset on Azure, but I would expect the behaviour would be to just look at the local files instead. This error only happens when using an AzureMLAssetDataset as an input locally. Any help is appreciated, thanks.

robertmcleod2 commented 1 month ago

An alternative for local development would be to set local catalogs under conf/local for local data storage, however when running kedro azureml run, these local catalogs seem to interfere with it somehow, and the AzureML pipelines do not create data assets anymore. This occurs even when the conf/local is ignored by the .dockerignore file, which is odd.

robertmcleod2 commented 1 month ago

After a bit more digging into why when running kedro azureml run, having the conf/local set seems to interfere with the values in conf/base, even when these values are in the .dockerignore, I found this code in the cli: https://github.com/getindata/kedro-azureml/blob/d5c2011c7ed7fdc03235bf2bd6701f1901d1139c/kedro_azureml/cli.py#L41-L55

Which sets the catalog env location as local, and can only be changed by setting the KEDRO_ENV environment variable. indeed, setting the KEDRO_ENV environment variable to base fixed this issue for me. This seems like a very hard to find feature, so to me would make more sense if the env can be set when running the kedro azureml run, and also the env is picked up from the values in the kedro settings.py file.

tomasvanpottelbergh commented 1 week ago

Regarding the original question: AzureMLAssetDataset was mainly designed to easily use Data Assets from your Azure ML workspace as inputs. Creating them from local runs was not implemented because it would create untraceable artifacts in the workspace. Using them in the middle of a pipeline seems to be an edge case that is currently not covered by the implementation. I agree it would make more sense if the dataset is retrieved from the output of the previous node in that case.

Regarding the workaround using conf/local: I'm not sure I understand the problem. The environment can be specified by adding --env {environment} to the kedro azureml run command. Does this not work for you?

robertmcleod2 commented 1 week ago

Hi Tomas, thanks for the response. For the workaround, I do not seem to be able to specify the environment using the --env option with kedro azureml run:

$ kedro azureml run --env base
Usage: kedro azureml run [OPTIONS]
Try 'kedro azureml run -h' for help.

Error: No such option: --env (Possible options: --aml-env, --env-var)
tomasvanpottelbergh commented 1 week ago

I guess this should be clarified in the docs, but the --env argument should go directly after azureml: kedro azureml --env base run.

robertmcleod2 commented 1 week ago

aha I see. Yes I agree this should definitely be clarified - that is an unusual functionality. Currently there isnt even a mention of the --env as an option in the docs, apart from a small mention on the Installation page. Would be helpful to have this detailed on the Quickstart page.

Could I ask, why is the --env argument set up in this way? couldnt it be added as an argument to the kedro azureml run command?

tomasvanpottelbergh commented 1 week ago

You'd have to ask @marrrcin as he implemented this, but I'd guess it was because it's the easiest way of adding the option to every kedro azureml command. Adding it to every command individually, so it would come at the end of the command, would be consistent with how Kedro does it.

marrrcin commented 1 week ago

but I'd guess it was because it's the easiest way of adding the option to every kedro azureml command

👆🏻👆🏻👆🏻 It's most likely that, but TBH it was super long time ago and I don't remember exactly :D This implementation is consistent across Getindata's Kedro plugins, e.g. https://github.com/getindata/kedro-vertexai/blob/5ee3304054dc1f913fb962ed1424d0fb42c7c08c/kedro_vertexai/cli.py#L36

BTW @robertmcleod2 there's an unwritten approach among Kedro users that base environment should be left out - Kedro's always defaults to run on local https://github.com/kedro-org/kedro/blob/ba981350ad57dbcaabf5fd758a9a3d4399a91f20/kedro/framework/project/__init__.py#L116