Open Joenetics opened 11 months ago
Hi @Joenetics, sorry for the long delay. You are raising some very good points, and I'll try to answer them all.
Allow us to choose what to save, where to save, when to save,...
Actually kedro-mlflow
is voluntarily quite opinionated. The goal is to log by default what must be logged to ensure reproducibility and not make easy to avoid it. I acknowledge that I should let some flexibility to handle special cases.
Here is what I plan to do :
As described in #441, if you think a dict should not be logged, it means it does not really contain parameters. Hence, you can avoid logging by converting it to a yaml file and load it directly in the catalog as a dataset:
# data/01_raw/your_dict_parameter.yaml
key1:
key11: a
key12: b
key2:
key21:
key 211: ca
key 222: cb
key22: d
#catalog.yaml
your_dict:
type: yaml.YAMLDataSet
filepath: data/01_raw/your_dict_parameter.yaml
Then replace params:your_dict
by your_dict
everywhere in your pipeline and everything will work exactly the same without logging this file.
Decision: I won't add this key since it is not a best practice and a clean workaround is available. I should document the workaround.
Actually there is a more general open question which is "How can i enable ot log any arbitrary artifact which is an input of the pipeline". There are related question on slack and an issue about this #446. In this situation, I could offer a helper to log a dataset in the catalog like:
your_dict:
type: kedro_mlflow.io.artifacts.MlflowInputDataSet # does not exists yet, same syntatx as MlflowArtifactDataset
data_set:
type: yaml.YAMLDataSet
filepath: data/01_raw/your_dict_parameter.yaml
Decision: I'll introduce the more general feature
kedro_mlflow
hooks to log extra featureDecision: I need to document this feature.
Decision: I won't fix it myself, but I'll accept PR
Description
I believe more options within the mlflow.yml file would be helpful. I refer to dicts within params here, but it could apply across the board.
Allow us to choose what to save, where to save, when to save,... Also related to bug https://github.com/Galileo-Galilei/kedro-mlflow/issues/441
1) 'Skip/ignore dict' option, e.g,
This would ignore all dicts instead of logging them as params
params: dict_params: skip: ignore 2) 'log as'
allow us to choose where to dump dicts
params: dict_params: log_as: 'artifact' path: 'dict_params/' 3) allow us to overwrite default kedro_mlflow hooks by using hooks.py file
4) allow users to choose how much to trancate, e.g truncate any params over length 200 to length 20.
code within mlflow.yml
params: dict_params: flatten: False
recursive: True
sep: "." long_params_strategy: truncate long_params_truncate_at: 200 long_params_truncate_to: 20
Context
We need versatility, and don't want to clutter the 'params' section of our mlflow experiments with endless dicts/other values. Or maybe we simply don't want to log them, as they don't change often and can use a timestamp to find out what they were anyway. Why would we want to store them every single run?
Possible Implementation
https://github.com/Galileo-Galilei/kedro-mlflow/blob/master/kedro_mlflow/framework/hooks/mlflow_hook.py This class could have the options implemented from the mlflow.py with a few more if-else statements