Closed vinaygrao4git closed 3 years ago
Hi @vinaygrao4git, glad to see you enjoy the plugin.
Unfortunately, it is very unlikely that you will be able to rerun a random old run if you do not make a lot of efforts on your own while running it the first time. I will try to explain why this is so hard hereafter.
mlflow
has a module called Mlflow Project
which provides a command line tool to run the project and a configuration file (MLProject). To make a parallel with Kedro, you have to specify your Pipeline's nodes in MLProject, as well as the parameters / data path and then run it with [mlflow run]
(https://github.com/mlflow/mlflow/tree/master/examples/multistep_workflow#multistep-workflow-example) command, eventually overriding some parameters at runtime. This is not flexible at all and not well suited to experimenting quickly. mlflow
stores the command you launch in the run (e.g. including the parameters overriden at runtime).
Kedro decouples configuration from DAG creation. This means that parameters / data path are not injected at runtime but are written in the config file. Kedro's Journal
is very misleading since the "git sha" stored is the one of the last commit, no matter if you have uncommitted changes. A worst case (but very common) scenario is the following:
my_param: 1
in parameters.yml
) with git_sha=1234
my_param: 2
)kedro run
commandmy_param=2
and git_sha=1234
In this example, assuming we relaunch the run from the git sha, the execution would use my_param=1
, which is not correct.
I originally planned to add more integration with Mlflow Project
but given 1) the initial limitation of Mlflow itself 2) the need to modify kedro's configuration management and Journal, I decided to move on. Even if I tried very hard to enforce such reproducibility mechanism, I could not manage to achieve something that is consistent with the different versions of Kedro and Mlflow which are in active development.
The good news are that if you train your model through a CI/CD pipeline, you get rid of all the moving parts and run from a git sha, so mlflow's default behaviour can apply. I plan to add a command in the plugin to add a MLProject file to with your desired kedro command (see #11), so at least it would enable this use case easier, but you still have to manage your data on your own. Unfortunately, I don't have a lot of time right now so I cannot give you a timeline.
The comment in the documentation (while referring to a real mlflow feature) is quite misleading and should be either explained further (with the detailed assumptions made for such reproducibility), or removed.
Thanks @Galileo-Galilei for your detailed response. We are building our own modules for the API (built on Kedro) to handle reproducibility. So I will close this issue.
Hi,
First of all, Thank you for having mlflow plugin for Kedro. It is working great and I am planning to adopt in the current project in my organization. The mlflow UI is great with run ID and other details.
As I was looking to rerun the pipeline using run-ID, I could not find any command or demo on how to rerun the older pipeline with its gitsha source version in the documentation. In mlflow UI, all required details are captured for rerun, so if you could please either share a sample demo or provide details on what command to be used for rerunning the pipeline using corresponding older gitsha and datasets, it would be very helpful. Attaching the screenshot from documentation for reference, where I could not find command details. Appreciate your help.
Thanks Vinay