Closed sebastiandro closed 5 months ago
Glad to see you're using the pipeline_ml_factory
, this is an underestimated feature of the plugin which is not well know I guess ;)
Good catch, I've seen a bunch of issues like this since the release of dataset factories. I think we can just call dataset.exists()
for each datasets to force the catalog to materialize them, which should make your code much simpler, something like this :
https://github.com/takikadiri/kedro-boot/commit/ac758fce16ad81cb01b6ac5b549c104100a13bb0
I'll try to push a fix soon, but PR are welcome!
Thank you for that suggestion, @Galileo-Galilei; that seemed to do the trick! I opened a PR: https://github.com/Galileo-Galilei/kedro-mlflow/pull/519, but unfortunately, I missed linking to this issue or set you as a reviewer 🙈 I can't seem to edit it now after I opened it.
Description
Hello there! First off, I want to thank you for this great plugin :)
The problem
I'm using modular pipelines and dataset factories. When using the
pipeline_ml_factory
to deploy pipelines to MLFlow, I've encountered issues with KedroMLFlow not recognizing catalogue entries with factory patterns.Use case example:
Let's say we have a model pipeline that we want to run on different datasets. To differentiate the datasets, we use namespaces.
If we have the current setup:
Our catalogue contains the following entries to make sure we are persisting the data and model separated by namespace:
If we setup the PipelineML instance for pipeline a:
When running the pipeline, KedroMLFlow does not recognise
dataset_a.model
as an artefact that should be uploaded. It instead throws an error:It works if I add the full name to the catalogue:
But then I lose the benefits of the nice naming patterns :)
Context
This is important since I am working on a project where we re-use the model pipelines across many datasets. We separate the datasets using namespaces. Hence, it would greatly help to use the naming patterns to reduce the work every time a new dataset is added.
Possible Implementation
This line raises the exception: https://github.com/Galileo-Galilei/kedro-mlflow/blob/master/kedro_mlflow/mlflow/kedro_pipeline_model.py#L119
It turns out that the
DataCatalog
passed to theKedroPipelineModel
constructor in theafter_pipeline_run
hook does not contain catalogue entries with factory patterns: https://github.com/Galileo-Galilei/kedro-mlflow/blob/e0033c5072c929a4c26cfaeaf61fcedf93d36522/kedro_mlflow/framework/hooks/mlflow_hook.py#L353C1-L366My workaround for now is to update the
DataCatalog
to resolve the factory patterns before sending it toKedroPipelineModel
:I use this approach from
kedro catalog resolve
. I add a helper function tomlflow_hooks.py
:And use it like this in the
after_pipeline_run
hook:Could this be a potential solution to the problem? Or is there a simpler way that I have totally missed :)