Open robertmcleod2 opened 4 months ago
An alternative for local development would be to set local catalogs under conf/local
for local data storage, however when running kedro azureml run, these local catalogs seem to interfere with it somehow, and the AzureML pipelines do not create data assets anymore. This occurs even when the conf/local
is ignored by the .dockerignore file, which is odd.
After a bit more digging into why when running kedro azureml run
, having the conf/local
set seems to interfere with the values in conf/base
, even when these values are in the .dockerignore, I found this code in the cli: https://github.com/getindata/kedro-azureml/blob/d5c2011c7ed7fdc03235bf2bd6701f1901d1139c/kedro_azureml/cli.py#L41-L55
Which sets the catalog env location as local
, and can only be changed by setting the KEDRO_ENV
environment variable. indeed, setting the KEDRO_ENV
environment variable to base
fixed this issue for me. This seems like a very hard to find feature, so to me would make more sense if the env can be set when running the kedro azureml run
, and also the env is picked up from the values in the kedro settings.py
file.
Regarding the original question: AzureMLAssetDataset
was mainly designed to easily use Data Assets from your Azure ML workspace as inputs. Creating them from local runs was not implemented because it would create untraceable artifacts in the workspace. Using them in the middle of a pipeline seems to be an edge case that is currently not covered by the implementation. I agree it would make more sense if the dataset is retrieved from the output of the previous node in that case.
Regarding the workaround using conf/local
: I'm not sure I understand the problem. The environment can be specified by adding --env {environment}
to the kedro azureml run
command. Does this not work for you?
Hi Tomas, thanks for the response. For the workaround, I do not seem to be able to specify the environment using the --env
option with kedro azureml run
:
$ kedro azureml run --env base
Usage: kedro azureml run [OPTIONS]
Try 'kedro azureml run -h' for help.
Error: No such option: --env (Possible options: --aml-env, --env-var)
I guess this should be clarified in the docs, but the --env
argument should go directly after azureml
: kedro azureml --env base run
.
aha I see. Yes I agree this should definitely be clarified - that is an unusual functionality. Currently there isnt even a mention of the --env as an option in the docs, apart from a small mention on the Installation
page. Would be helpful to have this detailed on the Quickstart
page.
Could I ask, why is the --env argument set up in this way? couldnt it be added as an argument to the kedro azureml run
command?
You'd have to ask @marrrcin as he implemented this, but I'd guess it was because it's the easiest way of adding the option to every kedro azureml
command. Adding it to every command individually, so it would come at the end of the command, would be consistent with how Kedro does it.
but I'd guess it was because it's the easiest way of adding the option to every kedro azureml command
👆🏻👆🏻👆🏻 It's most likely that, but TBH it was super long time ago and I don't remember exactly :D This implementation is consistent across Getindata's Kedro plugins, e.g. https://github.com/getindata/kedro-vertexai/blob/5ee3304054dc1f913fb962ed1424d0fb42c7c08c/kedro_vertexai/cli.py#L36
BTW @robertmcleod2 there's an unwritten approach among Kedro users that base
environment should be left out - Kedro's always defaults to run on local
https://github.com/kedro-org/kedro/blob/ba981350ad57dbcaabf5fd758a9a3d4399a91f20/kedro/framework/project/__init__.py#L116
When using the AzureMLAssetDataset it all works fine when deployed. However, I get an error locally when one pipeline outputs an AzureMLAssetDataset, and another pipeline tries to consume this asset. Here is a reproducible example:
The first pipeline:
The second pipeline:
and the catalog:
When running the first pipeline locally with
kedro run --pipeline test
, it creates a local file atdata/00_azurelocals/test_raw/local/test_raw.csv
. Then when running the second pipeline withkedro run --pipeline copy_test
, I get the following stack trace:So it seems like it is trying to find a version of the file on Azure, rather than using the local copy. When there is a version on Azure that exists, it puts the version number of the Dataset on azure in the directory path rather than
local
, i.e. it will look for a file atdata/00_azurelocals/test_raw/4/test_raw.csv
I'm not sure why it is trying to find the dataset on Azure, but I would expect the behaviour would be to just look at the local files instead. This error only happens when using an AzureMLAssetDataset as an input locally. Any help is appreciated, thanks.