Open AlexandreOuellet opened 6 days ago
Here's my current workaround :
I create a custom Dataset :
from typing import Any, Dict, Literal, Optional, Type, Union
from kedro.io.core import (
VERSION_KEY,
VERSIONED_FLAG_KEY,
AbstractDataset,
Version,
)
from kedro_azureml.config import AzureMLConfig
from kedro_azureml.datasets.asset_dataset import AzureMLAssetDataset
AzureMLDataAssetType = Literal["uri_file", "uri_folder"]
class AzureMLDataset(AzureMLAssetDataset):
def __init__(
self,
azureml_dataset: str,
dataset: Union[str, Type[AbstractDataset], Dict[str, Any]],
credentials=None, # passing the credentials
root_dir: str = "data",
filepath_arg: str = "filepath",
azureml_type: AzureMLDataAssetType = "uri_folder",
version: Optional[Version] = None,
metadata: Dict[str, Any] = None,
):
super().__init__(
azureml_dataset,
dataset,
root_dir,
filepath_arg,
azureml_type,
version,
metadata,
)
# Configure azureml_config according to credentials (for dataset factories)
self._azureml_config = AzureMLConfig(**credentials) if credentials else None
I also have a custom hook :
from kedro.framework.hooks import hook_impl
from kedro_azureml.config import AzureMLConfig
from utils.azure_ml_dataset import AzureMLDataset
from kedro_azureml.runner import AzurePipelinesRunner
from kedro.io import DataCatalog
from kedro.pipeline import Pipeline
class AMLRunHook:
@hook_impl
def after_context_created(self, context) -> None:
# Inject credentials into the context as a "credentials" key
if "azureml" not in context.config_loader.config_patterns.keys():
context.config_loader.config_patterns.update(
{"azureml": ["azureml*", "azureml*/**", "**/azureml*"]}
)
self.azure_config = AzureMLConfig(**context.config_loader["azureml"]["azure"])
azure_creds = {"azureml": self.azure_config.__dict__}
context.config_loader["credentials"] = {
**context.config_loader["credentials"],
**azure_creds,
}
@hook_impl
def before_pipeline_run(self, run_params, pipeline: Pipeline, catalog: DataCatalog):
"""Hook implementation to change dataset path for local runs.
Modified to handle when a dataset is a pattern dataset
Args:
run_params: The parameters that are passed to the run command.
pipeline: The ``Pipeline`` object representing the pipeline to be run.
catalog: The ``DataCatalog`` from which to fetch data.
"""
for input in pipeline.all_inputs():
if input in catalog:
dataset = catalog._get_dataset(input)
if isinstance(dataset, AzureMLDataset):
if AzurePipelinesRunner.__name__ not in run_params["runner"]:
# when running locally using an AzureMLAssetDataset
# as an intermediate dataset we don't want download
# but still set to run local with a local version.
if input not in pipeline.inputs():
dataset.as_local_intermediate()
# when running remotely we still want to provide information
# from the azureml config for getting the dataset version during
# remote runs
else:
dataset.as_remote()
catalog.add(input, dataset, replace=True)
aml_run_hook = AMLRunHook()
After that, I ensure that my hook is loaded correctly in settings.py
from my_pipeline.hooks import aml_run_hook
HOOKS = (aml_run_hook,)
For usage, it looks as follow :
"{name}_csv":
type: utils.azure_ml_dataset.AzureMLDataset
azureml_dataset: imcd_raw
root_dir: data/01_raw/
credentials: azureml
dataset:
type: pandas.CSVDataset
filepath: "{name}.csv"
After that, you should be able to do your regular kedro azureml run
and get the results as expected in Azure ML. The PR will take a while longer to come in, as I have not yet looked into properly adding unit tests (everything else is green though), but it seems to work correctly in my use case, and at least there's the workaround for those wishing to use pipeline datasets.
When used in a dataset factory, azureml's credentials are not setup correctly and creates a crash instead. For instance, the following crashes with missing credentials :
The issue is that when the catalog is created, the dataset factories are not actual datasets, but dataset_pattern, which are resolved afterward. Then the hook for after_catalog_created is called, but still no actual dataset for the dataset_pattern.
I've discussed with some people on kedro's slack channels, and they seem to agree that the best way to pass the credentials would be with a after_context_created hook.
I'll open a pull request soon to fix that, with the following pattern instead :