Reproduce the pipeline run with mlflow

vinaygrao4git commented 3 years ago

Hi,

First of all, Thank you for having mlflow plugin for Kedro. It is working great and I am planning to adopt in the current project in my organization. The mlflow UI is great with run ID and other details.

As I was looking to rerun the pipeline using run-ID, I could not find any command or demo on how to rerun the older pipeline with its gitsha source version in the documentation. In mlflow UI, all required details are captured for rerun, so if you could please either share a sample demo or provide details on what command to be used for rerunning the pipeline using corresponding older gitsha and datasets, it would be very helpful. Attaching the screenshot from documentation for reference, where I could not find command details. Appreciate your help.

Thanks Vinay

Galileo-Galilei commented 3 years ago

Hi @vinaygrao4git, glad to see you enjoy the plugin.

Unfortunately, it is very unlikely that you will be able to rerun a random old run if you do not make a lot of efforts on your own while running it the first time. I will try to explain why this is so hard hereafter.

How mlflow handle reproducibility

mlflow has a module called Mlflow Project which provides a command line tool to run the project and a configuration file (MLProject). To make a parallel with Kedro, you have to specify your Pipeline's nodes in MLProject, as well as the parameters / data path and then run it with [mlflow run] (https://github.com/mlflow/mlflow/tree/master/examples/multistep_workflow#multistep-workflow-example) command, eventually overriding some parameters at runtime. This is not flexible at all and not well suited to experimenting quickly. mlflow stores the command you launch in the run (e.g. including the parameters overriden at runtime).

the 1st challenge here is that data is not versioned and mlflow assumes it is immutable (e.g. the run is entirely rerunnable only with the path to the data). This is far to be true in real world projects.
the 2nd very important assumption is that you launch the run from a git uri. You can launch the command locally, but mlflow does offer any guarantee of reproducibility in this case (and does not even store the command since they assume there is no way to reproduce it reliably).

Why Kedro runs are even harder to reproduce

Kedro decouples configuration from DAG creation. This means that parameters / data path are not injected at runtime but are written in the config file. Kedro's Journal is very misleading since the "git sha" stored is the one of the last commit, no matter if you have uncommitted changes. A worst case (but very common) scenario is the following:

You commit your project in a given state (e.g with my_param: 1 in parameters.yml) with git_sha=1234
you change one parameter in the configuration file (say my_param: 2)
You launch kedro run command
In the mlflow ui, the plugin stores my_param=2 and git_sha=1234

In this example, assuming we relaunch the run from the git sha, the execution would use my_param=1, which is not correct.

Conclusion

I originally planned to add more integration with Mlflow Project but given 1) the initial limitation of Mlflow itself 2) the need to modify kedro's configuration management and Journal, I decided to move on. Even if I tried very hard to enforce such reproducibility mechanism, I could not manage to achieve something that is consistent with the different versions of Kedro and Mlflow which are in active development.

The good news are that if you train your model through a CI/CD pipeline, you get rid of all the moving parts and run from a git sha, so mlflow's default behaviour can apply. I plan to add a command in the plugin to add a MLProject file to with your desired kedro command (see #11), so at least it would enable this use case easier, but you still have to manage your data on your own. Unfortunately, I don't have a lot of time right now so I cannot give you a timeline.

The comment in the documentation (while referring to a real mlflow feature) is quite misleading and should be either explained further (with the detailed assumptions made for such reproducibility), or removed.

vinaygrao4git commented 3 years ago

Thanks @Galileo-Galilei for your detailed response. We are building our own modules for the API (built on Kedro) to handle reproducibility. So I will close this issue.

Galileo-Galilei / kedro-mlflow