getindata / kedro-azureml

Kedro plugin to support running workflows on Microsoft Azure ML Pipelines
https://kedro-azureml.readthedocs.io
Apache License 2.0
36 stars 15 forks source link

AzureMLPipelineDataSet not compatible with pipeline_ml_factory method from kedro-mlflow #53

Open jpoullet2000 opened 1 year ago

jpoullet2000 commented 1 year ago

The pipeline_ml_factory method in kedro-mlflow is a useful method to store artifacts (transformers, models) automatically (using kedro-mlflow hook). However, this method calls the method extract_pipeline_artifacts which requires the _filepath attribute to be available (see here). AzureMLPipelineDataSet class does not provide this attribute. Wouldn't it be possible to add it to the class attributes? Do you have any other suggestion to store the Mlflow Pipeline ?

marrrcin commented 1 year ago

If adding _filepath helps, then no problem. We're open to PRs :) @Galileo-Galilei any other suggestions?

marrrcin commented 1 year ago

Added _filepath in https://github.com/getindata/kedro-azureml/blob/2e5836b72256d7455d8525c7769a68d4c844ccf7/kedro_azureml/datasets/pipeline_dataset.py#L100 It's already released in 0.4.0. @jpoullet2000 please let me know if it fixes the problem.

FYI @tomasvanpottelbergh

jpoullet2000 commented 1 year ago

currently out of office. I'll come back to you in 2 weeks.

Galileo-Galilei commented 1 year ago

Sorry for the late reply, I was on holidays too. Just to understand, what does this dataset is intended to do?

Actually, kedro-mlflow should only check the filepath for the datasets it needs to use as artifacts for mlflow. So either this is a bug (kedro-mlflow does check the filepath on a dataset it should not) or this solution won't work (kedro-mlflow won't complain, but if there is no data at the given filepath, it will not be able to log it in mlflow nor to fetch it at inference time). What does your pipeline look like? What are you trying to do?

jpoullet2000 commented 1 year ago

Hi. Sorry for the late reply. The goal is to store a mlflow pipeline while running a azureml pipeline wrapping a kedro pipeline. I'd like to use the pipeline_ml_factory method for that. It seems that the issue comes from the fact that kedro-azureml decomposes the kedro nodes in azureml nodes and the transformers and models (pickle files) are not shared between the training pipeline and the inference pipeline. That's the reason why I wanted to use the AzureMLPipelineDataSet which should pass the data from one node to the other. But I'm still not convinced that it solves the issue (still testing)..

jpoullet2000 commented 1 year ago

As illustration with a simple pipeline. View with kedro viz: image

When I try to run the pipeline etl_ml_pipeline corresponding to the code:

    etl_ml_pipeline = create_etl_ml_pipeline()
    inference_pipeline_etl_ml = etl_ml_pipeline.only_nodes_with_tags("inference")
    training_pipeline_etl_ml = pipeline_ml_factory(
        training=etl_ml_pipeline.only_nodes_with_tags("training"),
        inference=inference_pipeline_etl_ml,
        input_name="X_test",
        log_model_kwargs=dict(
            artifact_path="poc_kedro_azureml_mlflow",
            # conda_env="src/requirements.txt",
            conda_env={
                "python": python_version(),
                "build_dependencies": ["pip"],
                "dependencies": [
                    f"poc_kedro_azureml_mlflow=={PROJECT_VERSION}",
                    {"pip": dependencies},
                ],
            },
            signature="auto",
        ),
    )

I get the following error

KedroMlflowPipelineMLError: The following inputs are free for the inference 
pipeline:
    - scaler
     - rf_model. 
No free input is allowed. Please make sure that 'inference.inputs()' are all in 
'training.all_outputs() + training.inputs()'except 'input_name' and parameters 
which starts with 'params:'

image

The pipeline code is

from kedro.pipeline import Pipeline, node
from poc_kedro_azureml_mlflow.pipelines.etl_ml_app.nodes import (
    split_data,
    scale_data_fit,
    scale_data_transform,
    train_rf_model,
    predict,
)

def create_pipeline(**kwargs) -> Pipeline:
    training_pipeline = Pipeline(
        [
            node(
                split_data,
                ["iris_data", "parameters"],
                outputs=["X_train", "X_test", "y_train", "y_test"],
                tags=["training", "etl_app"],
                name="split_data",
            ),
            node(
                scale_data_fit,
                "X_train",
                outputs=["X_train_scaled", "scaler"],
                tags=["training"],
                name="scale_data_fit",
            ),
            node(
                train_rf_model,
                ["X_train_scaled", "y_train"],
                "rf_model",
                tags="training",
                name="training_rf_model",
            ),
        ],
    )
    inference_pipeline = Pipeline(
        [
            node(
                scale_data_transform,
                ["X_test", "scaler"],
                outputs="X_test_scaled",
                tags=["inference"],
                name="scale_data_transform",
            ),
            node(
                predict,
                ["X_test_scaled", "rf_model"],
                "rf_predictions",
                tags="inference",
                name="predict_rf_model",
            ),
        ]
    )

    return training_pipeline + inference_pipeline
marrrcin commented 1 year ago

It's specific to kedro-mlflow, any hints @Galileo-Galilei ?

It seems that the issue comes from the fact that kedro-azureml decomposes the kedro nodes in azureml nodes and the transformers and models (pickle files) are not shared between the training pipeline and the inference pipeline.

As for that - we indeed split the kedro nodes into Azure ML nodes, but I don't understand the "are not shared between training and inference". Data is shared via Kedro's Data Catalog, so when any node needs to load something, it goes to the Data Catalog. While running on Azure ML, if the entry is missing from the catalog, our plugin automatically loads the data from the temporary storage set in azureml.yml https://github.com/getindata/kedro-azureml/blob/a040b3c65c57a38cf6a64ca0d0792e471abe6911/kedro_azureml/config.py#L95 If you've opted in the preview feature pipeline_data_passing, then the data will be passed via Azure ML-mounted files.

Maybe it's a problem in kedro-mlflow, that it cannot recognize that the data is passed implicitly. Have you tried explicitly defining your inputs/outputs (e.g. scaler, rf_model etc) in the catalog? If they will be defined, then there should be no issue with loading them from any node.

Galileo-Galilei commented 1 year ago

Hum, I'll have a deep dive in the code in the coming days, but I already have some comments:

  1. your kedro-viz graph should likely not work "as is" with plain kedro-mlflow so I am a bit confused : your scalre and RF models are (as the error message say) NOT input of your predict_with_mlflow function while they should be (how can you predict without the model?)
  2. the pipeline_ml_factory "magic" comes from a kedro hook which retrieves the artifact at the end of the pipeline; if the kedro nodes are converted to azure nodes, you loose the benefits of kedro hooks and there is no reason that it will work out of the box.
jpoullet2000 commented 1 year ago

PIpeline Inference Model contains both the scaler and RF model and is generated by the pipeline_ml_factory. I also have the feeling that the kedro hooks at the pipeline level are not usable with kedro-azureml. @marrrcin , can you confirm ?