Allow mlflow hooks to be overwriten, and more choice on what to log

Galileo-Galilei / kedro-mlflow

A kedro-plugin for integration of mlflow capabilities inside kedro projects (especially machine learning model versioning and packaging)

Apache License 2.0

195 stars 29 forks source link

Description

I believe more options within the mlflow.yml file would be helpful. I refer to dicts within params here, but it could apply across the board.

1) 'Skip/ignore dict' option, e.g,

Context

We need versatility, and don't want to clutter the 'params' section of our mlflow experiments with endless dicts/other values. Or maybe we simply don't want to log them, as they don't change often and can use a timestamp to find out what they were anyway. Why would we want to store them every single run?

Hi @Joenetics, sorry for the long delay. You are raising some very good points, and I'll try to answer them all.

Allow us to choose what to save, where to save, when to save,...

Actually kedro-mlflow is voluntarily quite opinionated. The goal is to log by default what must be logged to ensure reproducibility and not make easy to avoid it. I acknowledge that I should let some flexibility to handle special cases.

Here is what I plan to do :

'Skip/ignore dict' option

As described in #441, if you think a dict should not be logged, it means it does not really contain parameters. Hence, you can avoid logging by converting it to a yaml file and load it directly in the catalog as a dataset:

# data/01_raw/your_dict_parameter.yaml

 key1:
    key11: a
    key12: b
key2:
    key21: 
        key 211: ca
        key 222: cb
    key22: d

#catalog.yaml
your_dict: 
    type: yaml.YAMLDataSet
    filepath: data/01_raw/your_dict_parameter.yaml

Then replace params:your_dict by your_dict everywhere in your pipeline and everything will work exactly the same without logging this file.

Decision: I won't add this key since it is not a best practice and a clean workaround is available. I should document the workaround.

'log as' artifact

Actually there is a more general open question which is "How can i enable ot log any arbitrary artifact which is an input of the pipeline". There are related question on slack and an issue about this #446. In this situation, I could offer a helper to log a dataset in the catalog like:

your_dict: 
    type: kedro_mlflow.io.artifacts.MlflowInputDataSet # does not exists yet, same syntatx as MlflowArtifactDataset
    data_set:
      type: yaml.YAMLDataSet
      filepath: data/01_raw/your_dict_parameter.yaml

Decision: I'll introduce the more general feature

Overwrite default `kedro_mlflow` hooks to log extra feature

Decision: I need to document this feature.

Allow users to choose truncation length

Decision: I won't fix it myself, but I'll accept PR

Galileo-Galilei / kedro-mlflow