Galileo-Galilei / kedro-mlflow

A kedro-plugin for integration of mlflow capabilities inside kedro projects (especially machine learning model versioning and packaging)
https://kedro-mlflow.readthedocs.io/
Apache License 2.0
195 stars 29 forks source link

Allow mlflow hooks to be overwriten, and more choice on what to log #442

Open Joenetics opened 11 months ago

Joenetics commented 11 months ago

Description

I believe more options within the mlflow.yml file would be helpful. I refer to dicts within params here, but it could apply across the board.

Allow us to choose what to save, where to save, when to save,... Also related to bug https://github.com/Galileo-Galilei/kedro-mlflow/issues/441

1) 'Skip/ignore dict' option, e.g,

This would ignore all dicts instead of logging them as params

params: dict_params: skip: ignore 2) 'log as'

allow us to choose where to dump dicts

params: dict_params: log_as: 'artifact' path: 'dict_params/' 3) allow us to overwrite default kedro_mlflow hooks by using hooks.py file

4) allow users to choose how much to trancate, e.g truncate any params over length 200 to length 20.

code within mlflow.yml

params: dict_params: flatten: False
recursive: True
sep: "." long_params_strategy: truncate long_params_truncate_at: 200 long_params_truncate_to: 20

Context

We need versatility, and don't want to clutter the 'params' section of our mlflow experiments with endless dicts/other values. Or maybe we simply don't want to log them, as they don't change often and can use a timestamp to find out what they were anyway. Why would we want to store them every single run?

Possible Implementation

https://github.com/Galileo-Galilei/kedro-mlflow/blob/master/kedro_mlflow/framework/hooks/mlflow_hook.py This class could have the options implemented from the mlflow.py with a few more if-else statements

Galileo-Galilei commented 11 months ago

Hi @Joenetics, sorry for the long delay. You are raising some very good points, and I'll try to answer them all.

Allow us to choose what to save, where to save, when to save,...

Actually kedro-mlflow is voluntarily quite opinionated. The goal is to log by default what must be logged to ensure reproducibility and not make easy to avoid it. I acknowledge that I should let some flexibility to handle special cases.

Here is what I plan to do :

'Skip/ignore dict' option

As described in #441, if you think a dict should not be logged, it means it does not really contain parameters. Hence, you can avoid logging by converting it to a yaml file and load it directly in the catalog as a dataset:

# data/01_raw/your_dict_parameter.yaml

 key1:
    key11: a
    key12: b
key2:
    key21: 
        key 211: ca
        key 222: cb
    key22: d
#catalog.yaml
your_dict: 
    type: yaml.YAMLDataSet
    filepath: data/01_raw/your_dict_parameter.yaml

Then replace params:your_dict by your_dict everywhere in your pipeline and everything will work exactly the same without logging this file.

Decision: I won't add this key since it is not a best practice and a clean workaround is available. I should document the workaround.

'log as' artifact

Actually there is a more general open question which is "How can i enable ot log any arbitrary artifact which is an input of the pipeline". There are related question on slack and an issue about this #446. In this situation, I could offer a helper to log a dataset in the catalog like:

your_dict: 
    type: kedro_mlflow.io.artifacts.MlflowInputDataSet # does not exists yet, same syntatx as MlflowArtifactDataset
    data_set:
      type: yaml.YAMLDataSet
      filepath: data/01_raw/your_dict_parameter.yaml

Decision: I'll introduce the more general feature

Overwrite default kedro_mlflow hooks to log extra feature

Decision: I need to document this feature.

Allow users to choose truncation length

Decision: I won't fix it myself, but I'll accept PR