Galileo-Galilei / kedro-mlflow

A kedro-plugin for integration of mlflow capabilities inside kedro projects (especially machine learning model versioning and packaging)
https://kedro-mlflow.readthedocs.io/
Apache License 2.0
195 stars 29 forks source link

Log only passed CLI arguments and options as tags #32

Closed turn1a closed 3 years ago

turn1a commented 3 years ago

Currently, all possible CLI arguments and options are logged as tags even if they have no values (they weren't specified). I think this is unnecessary and we should only set the tags for the passed arguments and options. This would make tracking more readable. @Galileo-Galilei what do you think?

Galileo-Galilei commented 3 years ago

I though about it at first, but I had some troubles with it because the run_params are not properly recorded in a class, so wecannot access to their defaults value and access only the invoked ones. Kedro has a @fixme in the code for hooks implementation...

https://github.com/Galileo-Galilei/kedro-mlflow/blob/28e60e035b3f3f91e6cd9f9aabdec181a5b289ca/kedro_mlflow/framework/hooks/pipeline_hook.py#L28-L49

If we want to keep only passed arguments, we need to:

  1. parse the dict,
  2. compare to the defaults
  3. keep only those differents from the defaults. The problem here is that default values of all CLI args must be hardcoded inside kedro-mlflow to compare them to passed values (that is what I do to generate the kedro command https://github.com/Galileo-Galilei/kedro-mlflow/blob/28e60e035b3f3f91e6cd9f9aabdec181a5b289ca/kedro_mlflow/framework/hooks/pipeline_hook.py#L130-L154, but I will improve it a part of #11). This would make the plugin sensitive to any changes in the default values (which should not happen very often though) and when some new args are added from times to times (see 0.16.4 and hooks), making retro compatibility difficult. I may be wrong and you may have another and better solution.

Actually, I know that we could overcome this issue by inspecting the signature of the kedro run command to get the default params, but I am not a huge fan of this solution...

I suggest that we open an issue to Kedro directly to to address the issue of run_params management first before we make any change to kedro-mlflow. Are you ok with this?

Since this issue is not a critical change, I don't think we should tackle this right now (there are many other issues more important IMHO), do you agree?

turn1a commented 3 years ago

What I mean is the following change:

diff --git a/kedro_mlflow/framework/hooks/pipeline_hook.py b/kedro_mlflow/framework/hooks/pipeline_hook.py
index 01d5522..534b80e 100644
--- a/kedro_mlflow/framework/hooks/pipeline_hook.py
+++ b/kedro_mlflow/framework/hooks/pipeline_hook.py
@@ -67,7 +67,7 @@ class MlflowPipelineHook:
             run_name=run_name,
             nested=mlflow_conf.run_opts["nested"],
         )
-        mlflow.set_tags(run_params)
+        mlflow.set_tags({k: v for k, v in run_params.items() if v})
         # add manually git sha for consistency with the journal
         # TODO : this does not take into account not committed files, so it
         # does not ensure reproducibility. Define what to do.

In this way when you run for example kedro run --pipeline train you end up with the following tags:

image

Is there any reason we couldn't do it like that? I might be missing something so please correct me if I'm wrong.

Galileo-Galilei commented 3 years ago

Ok for this! Avoiding to store None value totally makes sense. I was thinking of something more difficult with an attempt to define whether a value was the default one or not (for instance you may want to avoid logging the environment above because "local" is the default value).