Open nblumoe opened 2 years ago
Hello @nblumoe, I have the very same use case for a while and I have been thinking on how to make this possible but this is quite hard for several reasons :
anyconfig
as a backend to replace the tags ${...}
in all documents matching the patterns. I must use the same patterns (and not just look for a catalog.yml
file) to deal with the multiple environments and the ability to split the catalog into many files and even folders. I have to parse these files again, and find out what has been modified and which DataSet
was concerned by such tags, because Kedro does not keep track of these informations. This may become slow and add a lot of boilerplate code in the plugin so I must be careful about this to avoid facing many performance / maitenance issues._arg_dict
attributes of the ConfigLoader. This may reduce the readibilty of the mlflow runs because it will log all your global variables, potentially including ones that are not even used in your pipeline (e.g. if you have global1
used in pipeline1
and global2
used in pipeline2
, running kedro run --pipeline1
will log global1
and global2
in your run, while global2 is not even used in your pipeline which is very confusing).In a nutshell, I plan to address this in the future, but I have other priorities at the moment for release 0.8.0: I really want to improve the model serving through the plugin since it seems to be a more demanded feature. I can't give an exact timeline, but I don't see this feature be implemented before several months.
Thanks for looking into this!
Does your first bullet point indicate that kedro should be able to handle params in the catalog? I didn't have luck with this yet:
# parameters.yml
timestamp: 2021-10-13
# catalog.yml
# reduced to essential data, this is not a complete catalog entry
my_data:
sql: select * from my_table where timestamp = ${params.timestamp}
${params.timestamp}
doesn't get replaced in the catalog when the actual SQL query is executed.
Oh sorry I thought you were already using this feature from Kedro. The object you are looking for is the TemplatedConfigLoader
. Once you have declared it in your hooks.py
, you can create a globals.yml
in your conf/<env>
folder and
# globals.yml <- this is what you are looking for
timestamp: 2021-10-13
# catalog.yml
# reduced to essential data, this is not a complete catalog entry
my_data:
sql: select * from my_table where timestamp = ${timestamp} # You have the right syntax
The problem for mlflow tracking is that I do not want to log your entire globals.yml
because it likely contains some parameters unrelated to your pipeline, so I'd like to log only the ones used in your current pipeline, but I don't know how to identify them.
Some good news: after some trials and errors, I think I have found a way to make it work.
However, to avoid migration costs, I will only implement this feature after kedro==0.18.0
and after migrating kedro-mlflow.
I will implement this feature, but only after kedro move to OmegaConfigLoader
in 0.19.
Hello, I have a similar feature request / use-case. I also need to track some specific parameters that are not inputs of nodes.
Therefore:
We can't track a global as explained in FAQ re TemplatedConfigLoader unless it's an input parameter of a node (which is not always the case)
I set a dataset selector in globals.yml
(so it can be overridden by command line). I want to track which dataset is used for this experiment (note, this is the only way to track a str
otherwise I would have used the MlflowMetricDataSet
tracking).
I had to hack a bit:
It would be great if I could set somehow extra parameters (e.g. the ones set by globals that control catalog
) that are not necessarily inputs of nodes.
Example: Define a MlflowParameterDataSet
?
Alternatively: track all parameters in the catalog even if not used in a node?
Hi @kalofolias, sorry for the late reply. I'd be really happy to make it work, because this annoys me too.
I just did not find a way to do it properly. A MlflowParameterDataSet
will not really solve the problem because I don't see how we can make it log conditionnally to the pipeline which is run.
Tracking all the parameters does not seem to be the right default, but maybe I shoudl add the possibility to "opt in" to this solution in case someone really wants it since we have no other solution for now.
Current state:
globals
to make this possible. str
inside as a workaround. I will also add documentation to add custom logic for logging with hooks as part of #442. Hopefully this will be cleaner than the current workaround.
Description
Allow parameters to be used in the catalog and track them to MLflow.
Context
Some data sources might be parameterised (e.g. via SQL
SELECT * FROM my_data WHERE date = <DATE-PARAM>
) and this should get tracked to MLflow too.Possible Implementation
Instead of just checking for params usage on Nodes, kedro-mlflow would also need to track params being used elsewhere. Could it just track all params, independently from where they are used.
I am not sure if kedro even allows such parameterised data sources in the catalog, thus this might required an upstream change on kedro first.