Context

As of today, kedro-mlflow offers a clear way to log parameters (through a Hook) and artifacts (through the MlflowArtifactDataSet class in the catalog.yml.

However, there is no weel-defined way to log metrics automatically in mlflow within the plugin. The user still have to log the metrics directly by calling log_metric within its self-defined function. this is not very convenient nor parametrizable, and makes the code lesss portable and messier.

Feature description

Provide a unique and weel defined way to og metrics through the plugin.

Possible Implementation

The easiest implemation would be to create a MlflowMetricDataSet very similar to MlflowArtifactDataSet to enable logging the metric directly inthe catalog.yml.

The main problem of this approach is that some metrics evolve over time, and we would like to log the metric on each update. This is not possible with this approach because the updates are made inside the node (when it is running), and not at the end.

@Galileo-Galilei did you do some work on it? I would be happy to take it over and provide a solution proposition as PR. (sorry for interrupting your holidays...)

Hello @akruszewski, thank you so much for taking this one. I have some general design decisions about MlflowDataSets in mind that I want to discuss with you.

General principles

I wish all DataSets implemented in kedro-mlflow respecs the following principles:

Consistency with `Kedro` and `mlflow`

Why should we enforce this principle

the more consistent our plugin is with Kedro and mlflow, the less documentation we need. If the DataSets save and load methods behaves in the most expected way it will enable users to refer to Kedro and mlflow documentation and to transition to the plugin smoothly.
It will also help migration easier for user who already use Kedro and mlflow together with multiple calls to mlflow logging function inside their nodes to move to the plugin.
What does it imply
Avoid side effect in the datasets method (e.g. writing to disk when it is not what the mlflow logging function does)
Keep consistency as much as possible between arguments / methods in our DataSets and with Kedro's ones. Typically, a save_csv must be consistent with the one in Kedro (exactly the same arguments, with the same names).
Eventually create several DataSets with explicit names to make clear what are the operations they performed.

Minimalism

Why should we enforce this principle

The maintenance of a plugin is time-consuming, and there is a high risk that documentation became not consistent with code. The more different cases handed by the code, the more operational risk. Avoid reinventing the wheel as much as possible.
The more side tasks perfomed by a DataSet (e.g., perfoming I/O on the disk), the more maintenance cost we will have, and it is an exponential burden because we will have to keep consistency accross our different DataSets. See the numerous PR in Kedro for DataSets to see that is is a entire system to maintain.
The more your classes contains very specific methods, the less readable and reusable they become. Once again, it increases maintenance costs.

In general, it is easier to add a functionality than to supress one. I want to be sure that extra functionality do not have an existing clear way to be done.

What does it imply

For example, your existing save_csv and save_json methods do not enable to open the file in append mode, to specify the encoding, to save in a pickle format... This will raise a lot of new PR with documentation update and consistency to ensure. How far do we want to go ?
The class becomes longer and harder to read
If the MlflowMetricDataSet only call log_metric but you absolutely need to save it on the disk too, you can either use transcoding or switch you tracking uri to a local path. In my opinion, this is really unecessary to try to manage this I/O operation by yourself since it is very easy to add it if a user want it apart

Application to this PR

In my opinion, we should implement TWO datasets and not only one (consistency with mlflow principle):

An MlflowMetricDataSet: deal with either a float, a list of float or a dict of {step: float, value: float}. The ability to specify a step for the metrics was introduced in mlflow a long time ago. It feels natural when you log metric at each iteration step (for instance because scoring your test data is slow at each epoch for a big neural network, so you may want to evaluate the metric only every X epochs). This is not a big feature but I think it is better to be consistent with mlflow.
An MlflowMetricsDataSet (with an s) : a dict of {string, one of the 3 value} : same as above, but log all the metrics in the dict (you already enable this in your current PR, I just think it is better to separate the two datasets for consistency)

For the MlflowMetricDataSet (MlflowMetricsDataSet is completely similar), it would go like this:

`_save`

if self.save_args["run_id"] is not None , call log_metric to the given run_id if it exists (with a different behaviour depending on if the value is a float, a list of float or dict of {float, float})
elif mlflow.active_run(), call log_metric to current run
else fail
register value in self._data (as for MemoryDataSet) to enable further loading without querying

`_load`

if self.load_args["run_id"] is not None, use MlflowClient() to load metric from the specified run_id
elif self._data is not None, use current in memory value (I put this condition in second because I think it is more coherent to query again if the run_id has changed in interactive mode)
else either fail (consistent with mlflow, where the metric should have been logged before it can be queried) or query the most recent value among all exsting runs? (this last one is very dangerous and will likely lead to potential unintuitive behaviour to users but is consistent with Kedro which will load the last file on your disk, no matter if it has been modified during this run or not)

Unsolved constraints

I have seen that you created a prefix attribute in order to distinguish betwen th datasets you calculated the metric on. It makes sense to me to have such a distinction, but the prefix might lead to further unattended complications:

I would really enjoy (but I can't see how to implement it easily right now) to be able to log the metric with the associated key equals to the name of the dataset in the catalog.yml file. It would introduce more consistency between the mlflow database and the catalog.yml. It would also enable to pull more easily a run in the future (by creating a catalog from a mlflow run for instance, if it makes sense).
I don't really see why the key attributes is not enough and you need the extra prefix attribute. Either you apply the same pipeline on a different datset, and I guess you record extra informations (date, number of rows, of columns, ...) on the data inside the mlflow run to distinguish easily between runs, or you calculate the same metric on two datasets inside the same run, and they correspond to different entries in your catalog so they have differents names / keys.

Do you have any idea on how we could enforce that?

Conclusion

@kaemo @akruszewski Does it make sense to you? Do you have strong evidences to go against this implementation? (Notice that it also have implications for how we should handle #12, but I will have another post tomorrow about this). I am completely open to discussion.

@Galileo-Galilei I agree with basically everything, but there's one use case that needs to be solved which your proposed solution does not address.

When there are two pipelines (say training and prediction) and the first one is producing a dataset/model/metric and the second one is executed as a separate run and depends on one of those artifacts (in a broad sense) there is no way to run such pipelines one after the other because you would need to specify run_id in load_args manually every time you run the training pipeline. This is a no-go for me because it excludes one of the most basic workflows. Pure Kedro solves this by loading the given path & latest version from the disk. With Kedro-MLflow I cannot currently run two pipelines with such dependencies one after the other in an automatic way.

I see two possible solutions:

This automatic loading of the latest serialised artifact version could be solved by saving it on disk during the first pipeline run. This is already happening in a way for artifacts (in MLflow sense) because to log them to MLflow we need to serialise them to disk first.
Search through the runs using mlflow.search_runs ordered by run date in descending fashion until we find a run with artifact with a given name/path stored in that run and load it from there. This can lead to a long search and load time if you have many experiments plus if two different entries in Data Catalog share same artifact path we got an issue.

Let me know how do you think this can be solved. Manually specifying a run_id to load from is not a solution for me. Also, what if one was given say an MLflow model from other project and needed to load it from the local path? There is no way to do that. One would need to create an artificial workflow to put it MLflow first, get run_id, specify it in Data Catalog and then load it from there.

I totally agree, and that's part of what I have in mind when I wrote

Notice that it also have implications for how we should handle #12, but I will have another post tomorrow about this

but I did not write this post fast enough ;)

Thoughts about your workflow

First of all, I think that the workflow you describe (2 separated pipeline, one for training, one for prediction) only concerns artifacts (and models, which are special artifacts), but not parameters nor metrics. I don't have any common use case when you may want to retrieve parameters/metrics in another run : params which need to be reused for prediction are always stored in an object which will be stored as an artifact, and metrics are "terminal" objects : another pipeline will likely use other data and calculate other metrics.

The point you are describing is one of the major disagreement between the data scientists and the data engineers at work (they do not use this plugin but a custom made version, it does not matter here). The point is that data scientists want to perform the operation you describe (load latest version on disk without providing run_id, reuse a model from a coworker copy/pasted locally), while data engineers want this operation (providing the run_id) to be manual and the artifacts downloaded from mlflow as the single source of truth, because when they deploy in production they want an extra check after training the model. Data engineers insist that manually providing the run_id is the responsibility of the "ops guy". They really stand against just using "the last version" to avoid operational risk.

The consensus we reached is to force providing the run id when running the pipeline as a global paramer (we use the TemplatedConfigLoader class to provide the mlflow run_id at runtime kedro run --pipeline=prediction --run-id=RUN_ID. This constrains exploration, but facilitates deployment (for exploration you don't have to modify the catalog since you can specify the id at runtime, so it is easier than modifying the catalog each time as you suggest).

I don't feel this is the right solution for us though, because the plugin will not be self contained and it will imply messing up with the project template which is a movin part. It will largely hurt the portability and ease to use of the plugin.

Suggestion for the DataSets of the plugin

For metrics: MlflowMetricDataSet and MlflowMetricsDataSet, I think above suggestions are still valid
For models:
1. We absolutely need a MlflowLocalModelDataSet (not sure about the name, but I want to distinguish it from the datasets that log in mlflow) whose save method call mlflow.save_model on the disk and whose load method loads from the disk (no logging involved here)
2. We may need a MlflowModelDataSet that performs both saving and logging, but we still have to define its load method and it will likely be redundant with other impmentation (see further)
For artifacts, we should change the current MlflowDataSet:
- rename it to `MlflowArtifactDataSet for consistency with the others
- change its save method to enable saving a folder and not only a file (because a model is a folder with the environment, metadata, artifacts)
- change its load method with following priorities: if self.load_args["run_id"] is provided load from mlflow, else load from local self.filepath, else fail.

This implementation would also enable to have the following entry in the catalog:

my_model:
    type: kedro_mlflow.io.MlflowArtifactDataSet
    data_set:
        type: kedro_mlflow.io.MlflowLocalModelDataSet  # or any valid kedro DataSet
        filepath: /path/to/a/LOCAL/destination/folder # must be a local folder

so it would made point 2.ii irrelevant, since it will be completely redundant with above entry.

Would it match all your needs if we do it this way ?

Hi @Galileo-Galilei, thanks for your review! I think that I implemented most of the things, but I have also few topics to cover in discussion. If I omitted something, please point it here.

I pushed today branch with the second implementation of this issue. In this one MlflowMetricsDataSet:

doesn't save anything on a filesystem,
uses load_args and save_args arguments (as for now just one load/save arg is used: run_id), I'm still thinking that there should be a possibility to pass run_id once as an argument to the constructor.
if run_id is given as load_argument, load method looking for metrics in a given run, otherwise it tries to find it in the current run (if exists), if fails, rises DataSetError,
if run_id is given as save_argument, save method uses it to save, otherwise, it try to obtain run_id from an active run, otherwise raises DataSetError (actually I will change it, it shouldn't fail, but it should create new run with log_metric, as this is the default behavior of MLflow),
It can take metrics as dict with floats, lists of floats, dict with value and step keys, or list with dicts with value and step keys. (we should consider adding timestamp key, as this is the third argument to log_metric),

What I didn't implement and why?

Treating MlflowMetricsDataSet as in-memory dataset.

As you mention in the comment to my original PR, we should limit side-effects to bare minimum. as logging to MLflow is our mine task (and side effects in terms of function purity), we probably should avoid putting it in the _data instance attribute.

The second argument would be that MlflowMetricsDataSet is not an in-memory dataset, but rather a dataset that is persistent.

Of course, if you are thinking that it is still better to have the _data instance attribute, I will add it, And I will be happy to discuss it further.

Why I left the `prefix` attribute?

I will describe a scenario where it is useful.

Let assume that we are training two models. For both of them, we have just one reusable node which evaluates models, which returns one metric: accuracy. How to distinguish them? The only way (in my opinion, I would be happy to hear about different solutions, because, to be honest, I don't like mine) is to define a prefix.

The best scenario would be of course take the dataset name from Kedro Catalog, but I didn't found a way to do that (and I believe that there is non).

@Galileo-Galilei I forgot to mention that in my opinion there is no point in doing the second dataset MlflowMetricDataset, as it would be almost identical to this one, and there is no semantic differences in log one or multiple metrics. You can utilize for that purpose the same dataset (as you are doing the same thing with CSV files, no matter if you have one or multiple rows in it).

PR: https://github.com/Galileo-Galilei/kedro-mlflow/pull/49

@kaemo @Galileo-Galilei I have also idea, let me know what do you think about it. If you would find a time, I would be happy to have a live session (chat/video chat/another live channel of communication), where we could discuss this topic.

Hello, I agree on almost everything.

Some comments:

uses load_args and save_args arguments (as for now just one load/save arg is used: run_id), I'm still thinking that there should be a possibility to pass run_id once as an argument to the constructor.

The more I think about it, the more I agree with you. My first idea was to enable the possibility to load from one run and log in another because some data scientists do this manually for some artifacts/models (as @kaemo suggested above, they share model locally during experimentation phase even if it sounds a bad practice for further productionizing). However :

it does not make sense at all for metrics, and I don't think we should support this because it hinders reproducibility.
if someone wants to do this, he can still load locally with one data set and relog in another run with another instance of the dataset
run_id have here the same role as the filepath and it would be much more consistent with Kedro to pass it to the constructor.

Conclusion: let's pass run_id to contructor

(actually I will change it, it shouldn't fail, but it should create new run with log_metric, as this is the default behavior of MLflow)

Agreed, let's keep mlflow behaviour even if I don't like it and think like you that it should rather fail. It should not have any impact while running a pipeline in comand line (because hooks properly manage run opening and closing), but it will change behaviour in interactive mode.

It can take metrics as dict with floats, lists of floats, dict with value and step keys, or list with dicts with value and step keys.

It should also handle "float" only, shouldn't it?

(we should consider adding timestamp key, as this is the third argument to log_metric)

I wish we could, but I don't think we can pass the timestamp key as an argument in log_metric unfortunately according to mlflow documentation

we should limit side-effects to bare minimum [...] we probably should avoid putting it in the _data instance attribute.

Agreed. My idea here was to avoid an extra http connexion when loading from a remote database but it is really not a big issue and avoiding side effects is more important to me.

For both of them, we have just one reusable node which evaluates models, which returns one metric: accuracy Yes, I totally understand that.

The best scenario would be of course take the dataset name from Kedro Catalog, but I didn't found a way to do that (and I believe that there is non).

I totally agree that it would be much better to retrieve the name of the DataCatalog. I think we can achieve it the following way:

remove prefix and keep a key attribute
implement an after_catalog_created method in the MlflowPipelineHook that modifies on the fly the MlflowMetricsDataSets to add the names in the catalog as the key attribute of the MlflowMetricsDataSets if the key attribute is not provided.
fail if there is no key attributes.

I think that having automatic consistency with the DataCatalog is a fair compensation for the additional complexity/side effect introduced by such an implementation.

there is no point in doing the second dataset MlflowMetricDataset, as it would be almost identical to this one, and there is no semantic differences in log one or multiple metrics.

Agreed, it will introduce too much code redundancy for very little additional gain.

P.S: The call is a very good idea. I've sent you linkedin invitation to exchange privately our coordinates.

@kaemo,

Search through the runs using mlflow.search_runs ordered by run date in descending fashion until we find a run with artifact with a given name/path stored in that run and load it from there.

I forgot to write it but using the most recent runs for loading is completely out of the possible solutions. Indeed I've learnt that some teams use a common mlflow for all data scientists (unlike my team where all data scientists have their own they can handle as they want+ a shared one for sharing models where training is triggered by CI/CD). This leads to conflicting writing issues (several runs can be launched by different data scientists at the same time. I feel that it is a very bad decision (and they complain that their mlflow is a total mess), but it is still what they use right now and we cannot exclude the possibility that even for my team the shared mlflow can have conflicts if several runs are launched concurrently (especially when models are long to train, e.g. deep learning models)

Galileo-Galilei / kedro-mlflow

Create a MlflowMetricDataSet #9

Context

Feature description

Possible Implementation

General principles

Consistency with `Kedro` and `mlflow`

Why should we enforce this principle

What does it imply

Minimalism

Why should we enforce this principle

What does it imply

Application to this PR

`_save`

`_load`

Unsolved constraints

Conclusion

Thoughts about your workflow

Suggestion for the DataSets of the plugin

What I didn't implement and why?

Treating MlflowMetricsDataSet as in-memory dataset.

Why I left the `prefix` attribute?

Galileo-Galilei / kedro-mlflow

Create a MlflowMetricDataSet #9

Context

Feature description

Possible Implementation

General principles

Consistency with Kedro and mlflow

Why should we enforce this principle

What does it imply

Minimalism

Why should we enforce this principle

What does it imply

Application to this PR

_save

_load

Unsolved constraints

Conclusion

Thoughts about your workflow

Suggestion for the DataSets of the plugin

What I didn't implement and why?

Treating MlflowMetricsDataSet as in-memory dataset.

Why I left the prefix attribute?

Consistency with `Kedro` and `mlflow`

`_save`

`_load`

Why I left the `prefix` attribute?