Open jpoullet2000 opened 1 year ago
If adding _filepath
helps, then no problem. We're open to PRs :)
@Galileo-Galilei any other suggestions?
Added _filepath
in https://github.com/getindata/kedro-azureml/blob/2e5836b72256d7455d8525c7769a68d4c844ccf7/kedro_azureml/datasets/pipeline_dataset.py#L100
It's already released in 0.4.0
. @jpoullet2000 please let me know if it fixes the problem.
FYI @tomasvanpottelbergh
currently out of office. I'll come back to you in 2 weeks.
Sorry for the late reply, I was on holidays too. Just to understand, what does this dataset is intended to do?
Actually, kedro-mlflow should only check the filepath for the datasets it needs to use as artifacts for mlflow. So either this is a bug (kedro-mlflow does check the filepath on a dataset it should not) or this solution won't work (kedro-mlflow won't complain, but if there is no data at the given filepath, it will not be able to log it in mlflow nor to fetch it at inference time). What does your pipeline look like? What are you trying to do?
Hi. Sorry for the late reply. The goal is to store a mlflow pipeline while running a azureml pipeline wrapping a kedro pipeline. I'd like to use the pipeline_ml_factory
method for that. It seems that the issue comes from the fact that kedro-azureml decomposes the kedro nodes in azureml nodes and the transformers and models (pickle files) are not shared between the training pipeline and the inference pipeline. That's the reason why I wanted to use the AzureMLPipelineDataSet which should pass the data from one node to the other. But I'm still not convinced that it solves the issue (still testing)..
As illustration with a simple pipeline. View with kedro viz:
When I try to run the pipeline etl_ml_pipeline corresponding to the code:
etl_ml_pipeline = create_etl_ml_pipeline()
inference_pipeline_etl_ml = etl_ml_pipeline.only_nodes_with_tags("inference")
training_pipeline_etl_ml = pipeline_ml_factory(
training=etl_ml_pipeline.only_nodes_with_tags("training"),
inference=inference_pipeline_etl_ml,
input_name="X_test",
log_model_kwargs=dict(
artifact_path="poc_kedro_azureml_mlflow",
# conda_env="src/requirements.txt",
conda_env={
"python": python_version(),
"build_dependencies": ["pip"],
"dependencies": [
f"poc_kedro_azureml_mlflow=={PROJECT_VERSION}",
{"pip": dependencies},
],
},
signature="auto",
),
)
I get the following error
KedroMlflowPipelineMLError: The following inputs are free for the inference
pipeline:
- scaler
- rf_model.
No free input is allowed. Please make sure that 'inference.inputs()' are all in
'training.all_outputs() + training.inputs()'except 'input_name' and parameters
which starts with 'params:'
The pipeline code is
from kedro.pipeline import Pipeline, node
from poc_kedro_azureml_mlflow.pipelines.etl_ml_app.nodes import (
split_data,
scale_data_fit,
scale_data_transform,
train_rf_model,
predict,
)
def create_pipeline(**kwargs) -> Pipeline:
training_pipeline = Pipeline(
[
node(
split_data,
["iris_data", "parameters"],
outputs=["X_train", "X_test", "y_train", "y_test"],
tags=["training", "etl_app"],
name="split_data",
),
node(
scale_data_fit,
"X_train",
outputs=["X_train_scaled", "scaler"],
tags=["training"],
name="scale_data_fit",
),
node(
train_rf_model,
["X_train_scaled", "y_train"],
"rf_model",
tags="training",
name="training_rf_model",
),
],
)
inference_pipeline = Pipeline(
[
node(
scale_data_transform,
["X_test", "scaler"],
outputs="X_test_scaled",
tags=["inference"],
name="scale_data_transform",
),
node(
predict,
["X_test_scaled", "rf_model"],
"rf_predictions",
tags="inference",
name="predict_rf_model",
),
]
)
return training_pipeline + inference_pipeline
It's specific to kedro-mlflow
, any hints @Galileo-Galilei ?
It seems that the issue comes from the fact that kedro-azureml decomposes the kedro nodes in azureml nodes and the transformers and models (pickle files) are not shared between the training pipeline and the inference pipeline.
As for that - we indeed split the kedro nodes into Azure ML nodes, but I don't understand the "are not shared between training and inference". Data is shared via Kedro's Data Catalog, so when any node needs to load something, it goes to the Data Catalog. While running on Azure ML, if the entry is missing from the catalog, our plugin automatically loads the data from the temporary storage set in azureml.yml
https://github.com/getindata/kedro-azureml/blob/a040b3c65c57a38cf6a64ca0d0792e471abe6911/kedro_azureml/config.py#L95
If you've opted in the preview feature pipeline_data_passing
, then the data will be passed via Azure ML-mounted files.
Maybe it's a problem in kedro-mlflow, that it cannot recognize that the data is passed implicitly. Have you tried explicitly defining your inputs/outputs (e.g. scaler
, rf_model
etc) in the catalog? If they will be defined, then there should be no issue with loading them from any node.
Hum, I'll have a deep dive in the code in the coming days, but I already have some comments:
kedro-viz
graph should likely not work "as is" with plain kedro-mlflow
so I am a bit confused : your scalre
and RF
models are (as the error message say) NOT input of your predict_with_mlflow
function while they should be (how can you predict without the model?)pipeline_ml_factory
"magic" comes from a kedro hook which retrieves the artifact at the end of the pipeline; if the kedro nodes are converted to azure nodes, you loose the benefits of kedro hooks and there is no reason that it will work out of the box. PIpeline Inference Model contains both the scaler and RF model and is generated by the pipeline_ml_factory
.
I also have the feeling that the kedro hooks at the pipeline level are not usable with kedro-azureml. @marrrcin , can you confirm ?
The
pipeline_ml_factory
method in kedro-mlflow is a useful method to store artifacts (transformers, models) automatically (using kedro-mlflow hook). However, this method calls the method extract_pipeline_artifacts which requires the_filepath
attribute to be available (see here).AzureMLPipelineDataSet
class does not provide this attribute. Wouldn't it be possible to add it to the class attributes? Do you have any other suggestion to store the Mlflow Pipeline ?