Galileo-Galilei / kedro-mlflow

A kedro-plugin for integration of mlflow capabilities inside kedro projects (especially machine learning model versioning and packaging)
https://kedro-mlflow.readthedocs.io/
Apache License 2.0
195 stars 29 forks source link

Create a MlflowMetricDataSet #9

Closed Galileo-Galilei closed 3 years ago

Galileo-Galilei commented 4 years ago

Context

As of today, kedro-mlflow offers a clear way to log parameters (through a Hook) and artifacts (through the MlflowArtifactDataSet class in the catalog.yml.

However, there is no weel-defined way to log metrics automatically in mlflow within the plugin. The user still have to log the metrics directly by calling log_metric within its self-defined function. this is not very convenient nor parametrizable, and makes the code lesss portable and messier.

Feature description

Provide a unique and weel defined way to og metrics through the plugin.

Possible Implementation

The easiest implemation would be to create a MlflowMetricDataSet very similar to MlflowArtifactDataSet to enable logging the metric directly inthe catalog.yml.

The main problem of this approach is that some metrics evolve over time, and we would like to log the metric on each update. This is not possible with this approach because the updates are made inside the node (when it is running), and not at the end.

akruszewski commented 3 years ago

@Galileo-Galilei did you do some work on it? I would be happy to take it over and provide a solution proposition as PR. (sorry for interrupting your holidays...)

Galileo-Galilei commented 3 years ago

Hello @akruszewski, thank you so much for taking this one. I have some general design decisions about MlflowDataSets in mind that I want to discuss with you.

General principles

I wish all DataSets implemented in kedro-mlflow respecs the following principles:

Consistency with Kedro and mlflow

Why should we enforce this principle

Minimalism

Why should we enforce this principle

In general, it is easier to add a functionality than to supress one. I want to be sure that extra functionality do not have an existing clear way to be done.

What does it imply

Application to this PR

In my opinion, we should implement TWO datasets and not only one (consistency with mlflow principle):

For the MlflowMetricDataSet (MlflowMetricsDataSet is completely similar), it would go like this:

_save

_load

Unsolved constraints

I have seen that you created a prefix attribute in order to distinguish betwen th datasets you calculated the metric on. It makes sense to me to have such a distinction, but the prefix might lead to further unattended complications:

Do you have any idea on how we could enforce that?

Conclusion

@kaemo @akruszewski Does it make sense to you? Do you have strong evidences to go against this implementation? (Notice that it also have implications for how we should handle #12, but I will have another post tomorrow about this). I am completely open to discussion.

turn1a commented 3 years ago

@Galileo-Galilei I agree with basically everything, but there's one use case that needs to be solved which your proposed solution does not address.

When there are two pipelines (say training and prediction) and the first one is producing a dataset/model/metric and the second one is executed as a separate run and depends on one of those artifacts (in a broad sense) there is no way to run such pipelines one after the other because you would need to specify run_id in load_args manually every time you run the training pipeline. This is a no-go for me because it excludes one of the most basic workflows. Pure Kedro solves this by loading the given path & latest version from the disk. With Kedro-MLflow I cannot currently run two pipelines with such dependencies one after the other in an automatic way.

I see two possible solutions:

  1. This automatic loading of the latest serialised artifact version could be solved by saving it on disk during the first pipeline run. This is already happening in a way for artifacts (in MLflow sense) because to log them to MLflow we need to serialise them to disk first.
  2. Search through the runs using mlflow.search_runs ordered by run date in descending fashion until we find a run with artifact with a given name/path stored in that run and load it from there. This can lead to a long search and load time if you have many experiments plus if two different entries in Data Catalog share same artifact path we got an issue.

Let me know how do you think this can be solved. Manually specifying a run_id to load from is not a solution for me. Also, what if one was given say an MLflow model from other project and needed to load it from the local path? There is no way to do that. One would need to create an artificial workflow to put it MLflow first, get run_id, specify it in Data Catalog and then load it from there.

Galileo-Galilei commented 3 years ago

I totally agree, and that's part of what I have in mind when I wrote

Notice that it also have implications for how we should handle #12, but I will have another post tomorrow about this

but I did not write this post fast enough ;)

Thoughts about your workflow

First of all, I think that the workflow you describe (2 separated pipeline, one for training, one for prediction) only concerns artifacts (and models, which are special artifacts), but not parameters nor metrics. I don't have any common use case when you may want to retrieve parameters/metrics in another run : params which need to be reused for prediction are always stored in an object which will be stored as an artifact, and metrics are "terminal" objects : another pipeline will likely use other data and calculate other metrics.

The point you are describing is one of the major disagreement between the data scientists and the data engineers at work (they do not use this plugin but a custom made version, it does not matter here). The point is that data scientists want to perform the operation you describe (load latest version on disk without providing run_id, reuse a model from a coworker copy/pasted locally), while data engineers want this operation (providing the run_id) to be manual and the artifacts downloaded from mlflow as the single source of truth, because when they deploy in production they want an extra check after training the model. Data engineers insist that manually providing the run_id is the responsibility of the "ops guy". They really stand against just using "the last version" to avoid operational risk.

The consensus we reached is to force providing the run id when running the pipeline as a global paramer (we use the TemplatedConfigLoader class to provide the mlflow run_id at runtime kedro run --pipeline=prediction --run-id=RUN_ID. This constrains exploration, but facilitates deployment (for exploration you don't have to modify the catalog since you can specify the id at runtime, so it is easier than modifying the catalog each time as you suggest).

I don't feel this is the right solution for us though, because the plugin will not be self contained and it will imply messing up with the project template which is a movin part. It will largely hurt the portability and ease to use of the plugin.

Suggestion for the DataSets of the plugin

  1. For metrics: MlflowMetricDataSet and MlflowMetricsDataSet, I think above suggestions are still valid

  2. For models:

    1. We absolutely need a MlflowLocalModelDataSet (not sure about the name, but I want to distinguish it from the datasets that log in mlflow) whose save method call mlflow.save_model on the disk and whose load method loads from the disk (no logging involved here)
    2. We may need a MlflowModelDataSet that performs both saving and logging, but we still have to define its load method and it will likely be redundant with other impmentation (see further)
  3. For artifacts, we should change the current MlflowDataSet:

    • rename it to `MlflowArtifactDataSet for consistency with the others
    • change its save method to enable saving a folder and not only a file (because a model is a folder with the environment, metadata, artifacts)
    • change its load method with following priorities: if self.load_args["run_id"] is provided load from mlflow, else load from local self.filepath, else fail.

This implementation would also enable to have the following entry in the catalog:

my_model:
    type: kedro_mlflow.io.MlflowArtifactDataSet
    data_set:
        type: kedro_mlflow.io.MlflowLocalModelDataSet  # or any valid kedro DataSet
        filepath: /path/to/a/LOCAL/destination/folder # must be a local folder

so it would made point 2.ii irrelevant, since it will be completely redundant with above entry.

Would it match all your needs if we do it this way ?

akruszewski commented 3 years ago

Hi @Galileo-Galilei, thanks for your review! I think that I implemented most of the things, but I have also few topics to cover in discussion. If I omitted something, please point it here.

I pushed today branch with the second implementation of this issue. In this one MlflowMetricsDataSet:

What I didn't implement and why?

Treating MlflowMetricsDataSet as in-memory dataset.

As you mention in the comment to my original PR, we should limit side-effects to bare minimum. as logging to MLflow is our mine task (and side effects in terms of function purity), we probably should avoid putting it in the _data instance attribute.

The second argument would be that MlflowMetricsDataSet is not an in-memory dataset, but rather a dataset that is persistent.

Of course, if you are thinking that it is still better to have the _data instance attribute, I will add it, And I will be happy to discuss it further.

Why I left the prefix attribute?

I will describe a scenario where it is useful.

Let assume that we are training two models. For both of them, we have just one reusable node which evaluates models, which returns one metric: accuracy. How to distinguish them? The only way (in my opinion, I would be happy to hear about different solutions, because, to be honest, I don't like mine) is to define a prefix.

The best scenario would be of course take the dataset name from Kedro Catalog, but I didn't found a way to do that (and I believe that there is non).

akruszewski commented 3 years ago

@Galileo-Galilei I forgot to mention that in my opinion there is no point in doing the second dataset MlflowMetricDataset, as it would be almost identical to this one, and there is no semantic differences in log one or multiple metrics. You can utilize for that purpose the same dataset (as you are doing the same thing with CSV files, no matter if you have one or multiple rows in it).

akruszewski commented 3 years ago

PR: https://github.com/Galileo-Galilei/kedro-mlflow/pull/49

@kaemo @Galileo-Galilei I have also idea, let me know what do you think about it. If you would find a time, I would be happy to have a live session (chat/video chat/another live channel of communication), where we could discuss this topic.

Galileo-Galilei commented 3 years ago

Hello, I agree on almost everything.

Some comments:

uses load_args and save_args arguments (as for now just one load/save arg is used: run_id), I'm still thinking that there should be a possibility to pass run_id once as an argument to the constructor.

The more I think about it, the more I agree with you. My first idea was to enable the possibility to load from one run and log in another because some data scientists do this manually for some artifacts/models (as @kaemo suggested above, they share model locally during experimentation phase even if it sounds a bad practice for further productionizing). However :

Conclusion: let's pass run_id to contructor

(actually I will change it, it shouldn't fail, but it should create new run with log_metric, as this is the default behavior of MLflow)

Agreed, let's keep mlflow behaviour even if I don't like it and think like you that it should rather fail. It should not have any impact while running a pipeline in comand line (because hooks properly manage run opening and closing), but it will change behaviour in interactive mode.

It can take metrics as dict with floats, lists of floats, dict with value and step keys, or list with dicts with value and step keys.

It should also handle "float" only, shouldn't it?

(we should consider adding timestamp key, as this is the third argument to log_metric)

I wish we could, but I don't think we can pass the timestamp key as an argument in log_metric unfortunately according to mlflow documentation

we should limit side-effects to bare minimum [...] we probably should avoid putting it in the _data instance attribute.

Agreed. My idea here was to avoid an extra http connexion when loading from a remote database but it is really not a big issue and avoiding side effects is more important to me.

For both of them, we have just one reusable node which evaluates models, which returns one metric: accuracy Yes, I totally understand that.

The best scenario would be of course take the dataset name from Kedro Catalog, but I didn't found a way to do that (and I believe that there is non).

I totally agree that it would be much better to retrieve the name of the DataCatalog. I think we can achieve it the following way:

I think that having automatic consistency with the DataCatalog is a fair compensation for the additional complexity/side effect introduced by such an implementation.

there is no point in doing the second dataset MlflowMetricDataset, as it would be almost identical to this one, and there is no semantic differences in log one or multiple metrics.

Agreed, it will introduce too much code redundancy for very little additional gain.

P.S: The call is a very good idea. I've sent you linkedin invitation to exchange privately our coordinates.

Galileo-Galilei commented 3 years ago

@kaemo,

Search through the runs using mlflow.search_runs ordered by run date in descending fashion until we find a run with artifact with a given name/path stored in that run and load it from there.

I forgot to write it but using the most recent runs for loading is completely out of the possible solutions. Indeed I've learnt that some teams use a common mlflow for all data scientists (unlike my team where all data scientists have their own they can handle as they want+ a shared one for sharing models where training is triggered by CI/CD). This leads to conflicting writing issues (several runs can be launched by different data scientists at the same time. I feel that it is a very bad decision (and they complain that their mlflow is a total mess), but it is still what they use right now and we cannot exclude the possibility that even for my team the shared mlflow can have conflicts if several runs are launched concurrently (especially when models are long to train, e.g. deep learning models)