Galileo-Galilei / kedro-mlflow

A kedro-plugin for integration of mlflow capabilities inside kedro projects (especially machine learning model versioning and packaging)
https://kedro-mlflow.readthedocs.io/
Apache License 2.0
194 stars 29 forks source link

Support summation for `PipelineML` #554

Open mck-star-yar opened 1 month ago

mck-star-yar commented 1 month ago

Description

This feature would simplify pipeline ensembling a lot if PipelineML would implement __sum__ and other dunder methods.

Here's a simplified example. We have ds pipeline defined as data_prep + training + report_training, where actual model training happens in the training. All four are exposed in the model registry. So users can trigger training pipelines by calling either ds or training. Given I can't append the result of pipeline_ml_factory to define ds, I should wrap both ds and training with ml wrapper:

Current code:

data_prep = ...
training = ...
inference = ...
training_ml = pipeline_ml_factory(
    training_pipeline=training,
    inference=inference,
    input_name="feature_table",
    log_model_kwargs={
        "artifact_path": "some/path/to/artifact",
        "registered_model_name": "some_model_name",
    }
)
report_training = ...

ds = data_prep + training + report_training
ds_ml = pipeline_ml_factory(
    training_pipeline=ds,
    inference=inference,
    input_name="feature_table",
    log_model_kwargs={
        "artifact_path": "some/path/to/artifact",
        "registered_model_name": "some_model_name",
    }
)
pipeline_registry = {
  'ds': ds_ml,
  'training': training_ml,
  'data_prep': data_prep,
  'report_training': report_training,
}

Desired code:

data_prep = ...
training = ...
inference = ...
training = pipeline_ml_factory(
    training_pipeline=training,
    inference=inference,
    input_name="feature_table",
    log_model_kwargs={
        "artifact_path": "some/path/to/artifact",
        "registered_model_name": "some_model_name",
    }
)
report_training = ...
ds = data_prep + training + report_training

pipeline_registry = {
  'ds': ds,
  'training': training,
  'data_prep': data_prep,
  'report_training': report_training,
}

Of course the alternative here is to extract this piece of code and do wrapping for each of the functions that contain training step; but that'd require manual tracking each time the pipeline is updated opposed to having only training step wrapped into this new class.

I can see that this is the behavior by design. Is there any specific reason why this limitation is imposed?

Possible Implementation

Implement dunder methods of PipelineML

Galileo-Galilei commented 1 month ago

I can see the value of this, but this is a very difficult problem, because :

mostly because summing / filtering pipelines breaks the assumption of a single input, and all inputs of inference are outputs of training. You need to know exactly what you are doing to ensure your pipeline is indeed a PipelineML.

I could eventually let summation happen, and have an error message in case of invalid output, but I am afraid this will raise even more questions...

mck-star-yar commented 1 month ago

Yeah, that makes perfect sense. I think we should leave it as it is. And you could mention this discussion in the code itself to verbalize this design decision.

Galileo-Galilei commented 1 month ago

I've palyed a little bit with it and it's possible to add one PipelineML with a Pipeline, but it dos not make sense to add two PipelineML altogether (we can add the training, but what about the inference.

As discussed, I'll just document the decision