Track the globals parameters used in the DataCatalog when using the TemplatedConfigLoader

nblumoe commented 2 years ago

Description

Allow parameters to be used in the catalog and track them to MLflow.

Context

Some data sources might be parameterised (e.g. via SQL SELECT * FROM my_data WHERE date = <DATE-PARAM>) and this should get tracked to MLflow too.

Possible Implementation

Instead of just checking for params usage on Nodes, kedro-mlflow would also need to track params being used elsewhere. Could it just track all params, independently from where they are used.

I am not sure if kedro even allows such parameterised data sources in the catalog, thus this might required an upstream change on kedro first.

Galileo-Galilei commented 2 years ago

Hello @nblumoe, I have the very same use case for a while and I have been thinking on how to make this possible but this is quite hard for several reasons :

the catalog's files are parsed manually using a regex https://github.com/quantumblacklabs/kedro/blob/20f836695c2f1e72f262d1747e47b7b7352a4aa0/kedro/config/templated_config.py#L194 and anyconfig as a backend to replace the tags ${...} in all documents matching the patterns. I must use the same patterns (and not just look for a catalog.yml file) to deal with the multiple environments and the ability to split the catalog into many files and even folders. I have to parse these files again, and find out what has been modified and which DataSet was concerned by such tags, because Kedro does not keep track of these informations. This may become slow and add a lot of boilerplate code in the plugin so I must be careful about this to avoid facing many performance / maitenance issues.
It is possible and easy to simply log all "global" variables in mlflow using the _arg_dict attributes of the ConfigLoader. This may reduce the readibilty of the mlflow runs because it will log all your global variables, potentially including ones that are not even used in your pipeline (e.g. if you have global1 used in pipeline1 and global2 used in pipeline2, running kedro run --pipeline1 will log global1 and global2 in your run, while global2 is not even used in your pipeline which is very confusing).

In a nutshell, I plan to address this in the future, but I have other priorities at the moment for release 0.8.0: I really want to improve the model serving through the plugin since it seems to be a more demanded feature. I can't give an exact timeline, but I don't see this feature be implemented before several months.

nblumoe commented 2 years ago

Thanks for looking into this!

Does your first bullet point indicate that kedro should be able to handle params in the catalog? I didn't have luck with this yet:

# parameters.yml
timestamp: 2021-10-13

# catalog.yml
# reduced to essential data, this is not a complete catalog entry
my_data:
  sql: select * from my_table where timestamp = ${params.timestamp}

${params.timestamp} doesn't get replaced in the catalog when the actual SQL query is executed.

Galileo-Galilei commented 2 years ago

Oh sorry I thought you were already using this feature from Kedro. The object you are looking for is the TemplatedConfigLoader. Once you have declared it in your hooks.py, you can create a globals.yml in your conf/<env> folder and

# globals.yml <- this is what you are looking for
timestamp: 2021-10-13

# catalog.yml
# reduced to essential data, this is not a complete catalog entry
my_data:
  sql: select * from my_table where timestamp = ${timestamp} # You have the right syntax

The problem for mlflow tracking is that I do not want to log your entire globals.yml because it likely contains some parameters unrelated to your pipeline, so I'd like to log only the ones used in your current pipeline, but I don't know how to identify them.

Galileo-Galilei commented 2 years ago

Some good news: after some trials and errors, I think I have found a way to make it work.

However, to avoid migration costs, I will only implement this feature after kedro==0.18.0 and after migrating kedro-mlflow.

Galileo-Galilei commented 1 year ago

I will implement this feature, but only after kedro move to OmegaConfigLoader in 0.19.

kalofolias commented 1 year ago

Hello, I have a similar feature request / use-case. I also need to track some specific parameters that are not inputs of nodes.

Problem

Currently the only way to track parameters is "automatic logging" (correct me if I'm wrong).
The only parameters tracked automatically are inputs of nodes.

Therefore:

We can't track a global as explained in FAQ re TemplatedConfigLoader unless it's an input parameter of a node (which is not always the case)

Use-case

I set a dataset selector in globals.yml (so it can be overridden by command line). I want to track which dataset is used for this experiment (note, this is the only way to track a str otherwise I would have used the MlflowMetricDataSet tracking).

Current solution

I had to hack a bit:

create a dummy node with an input parameter that I want to track

Desired behaviour

It would be great if I could set somehow extra parameters (e.g. the ones set by globals that control catalog) that are not necessarily inputs of nodes.

Example: Define a MlflowParameterDataSet ?

Alternatively: track all parameters in the catalog even if not used in a node?

Galileo-Galilei commented 1 year ago

Hi @kalofolias, sorry for the late reply. I'd be really happy to make it work, because this annoys me too.

I just did not find a way to do it properly. A MlflowParameterDataSet will not really solve the problem because I don't see how we can make it log conditionnally to the pipeline which is run.

Tracking all the parameters does not seem to be the right default, but maybe I shoudl add the possibility to "opt in" to this solution in case someone really wants it since we have no other solution for now.

Galileo-Galilei commented 11 months ago

Current state:

@nblumoe: Logging globals is not possible yet, but I am waiting for kedro's official release for the OmegaConfigLoader and hopefully this will be achieved in 0.19. I have a blocking issue that I opened in the kedro repo: https://github.com/kedro-org/kedro/issues/2973. EDIT: Since kedro may take times to implement this, I may find a way to decorate the globals to make this possible.
@kalofolias: I will make possible to log inputs artifacts with #446. You will be able to log a yaml file with a single str inside as a workaround. I will also add documentation to add custom logic for logging with hooks as part of #442. Hopefully this will be cleaner than the current workaround.

Galileo-Galilei / kedro-mlflow