Galileo-Galilei / kedro-mlflow

A kedro-plugin for integration of mlflow capabilities inside kedro projects (especially machine learning model versioning and packaging)
https://kedro-mlflow.readthedocs.io/
Apache License 2.0
195 stars 29 forks source link

Track the globals parameters used in the DataCatalog when using the TemplatedConfigLoader #253

Open nblumoe opened 2 years ago

nblumoe commented 2 years ago

Description

Allow parameters to be used in the catalog and track them to MLflow.

Context

Some data sources might be parameterised (e.g. via SQL SELECT * FROM my_data WHERE date = <DATE-PARAM>) and this should get tracked to MLflow too.

Possible Implementation

Instead of just checking for params usage on Nodes, kedro-mlflow would also need to track params being used elsewhere. Could it just track all params, independently from where they are used.

I am not sure if kedro even allows such parameterised data sources in the catalog, thus this might required an upstream change on kedro first.

Galileo-Galilei commented 2 years ago

Hello @nblumoe, I have the very same use case for a while and I have been thinking on how to make this possible but this is quite hard for several reasons :

In a nutshell, I plan to address this in the future, but I have other priorities at the moment for release 0.8.0: I really want to improve the model serving through the plugin since it seems to be a more demanded feature. I can't give an exact timeline, but I don't see this feature be implemented before several months.

nblumoe commented 2 years ago

Thanks for looking into this!

Does your first bullet point indicate that kedro should be able to handle params in the catalog? I didn't have luck with this yet:

# parameters.yml
timestamp: 2021-10-13

# catalog.yml
# reduced to essential data, this is not a complete catalog entry
my_data:
  sql: select * from my_table where timestamp = ${params.timestamp}

${params.timestamp} doesn't get replaced in the catalog when the actual SQL query is executed.

Galileo-Galilei commented 2 years ago

Oh sorry I thought you were already using this feature from Kedro. The object you are looking for is the TemplatedConfigLoader. Once you have declared it in your hooks.py, you can create a globals.yml in your conf/<env> folder and

# globals.yml <- this is what you are looking for
timestamp: 2021-10-13
# catalog.yml
# reduced to essential data, this is not a complete catalog entry
my_data:
  sql: select * from my_table where timestamp = ${timestamp} # You have the right syntax

The problem for mlflow tracking is that I do not want to log your entire globals.yml because it likely contains some parameters unrelated to your pipeline, so I'd like to log only the ones used in your current pipeline, but I don't know how to identify them.

Galileo-Galilei commented 2 years ago

Some good news: after some trials and errors, I think I have found a way to make it work.

However, to avoid migration costs, I will only implement this feature after kedro==0.18.0 and after migrating kedro-mlflow.

Galileo-Galilei commented 1 year ago

I will implement this feature, but only after kedro move to OmegaConfigLoader in 0.19.

kalofolias commented 1 year ago

Hello, I have a similar feature request / use-case. I also need to track some specific parameters that are not inputs of nodes.

Problem

  1. Currently the only way to track parameters is "automatic logging" (correct me if I'm wrong).
  2. The only parameters tracked automatically are inputs of nodes.

Therefore:

We can't track a global as explained in FAQ re TemplatedConfigLoader unless it's an input parameter of a node (which is not always the case)

Use-case

I set a dataset selector in globals.yml (so it can be overridden by command line). I want to track which dataset is used for this experiment (note, this is the only way to track a str otherwise I would have used the MlflowMetricDataSet tracking).

Current solution

I had to hack a bit:

Desired behaviour

It would be great if I could set somehow extra parameters (e.g. the ones set by globals that control catalog) that are not necessarily inputs of nodes.

Example: Define a MlflowParameterDataSet ?

Alternatively: track all parameters in the catalog even if not used in a node?

Galileo-Galilei commented 1 year ago

Hi @kalofolias, sorry for the late reply. I'd be really happy to make it work, because this annoys me too.

I just did not find a way to do it properly. A MlflowParameterDataSet will not really solve the problem because I don't see how we can make it log conditionnally to the pipeline which is run.

Tracking all the parameters does not seem to be the right default, but maybe I shoudl add the possibility to "opt in" to this solution in case someone really wants it since we have no other solution for now.

Galileo-Galilei commented 11 months ago

Current state: