Confusing error message when trying to override `run_id`

astrojuanlu commented 1 month ago

Description

I read in the docs that

Consequently it will never be possible to save in a specific mlflow run_id if you launch a pipeline with the kedro run command because the MlflowHook creates a new run before each pipeline run.

...but I'm stubborn so I still wanted to try.

For that, I specified my run_id both in the configuration and in the dataset:

# catalog.yml
regressor:
  type: kedro_mlflow.io.models.MlflowModelTrackingDataset
  flavor: mlflow.sklearn
  run_id: ${runtime_params:mlflow_run_id}

# mlflow.yml
tracking:
  run:
    id: ${runtime_params:mlflow_run_id}

And when launching kedro run --params mlflow_run_id=xxx I get the following error:

kedro.io.core.DatasetError: 'run_id' cannot be specified if there is an mlflow active run.Run_id mismatch: 
 - 'run_id'=6ffe573f95124a65a04dcece1589fbdc
 - active_run id=6ffe573f95124a65a04dcece1589fbdc

Expected Result

If the active_run and the specified run_id are the same, proceed.

Actual Result

(The error above)

Your Environment

kedro and kedro-mlflow version used (pip show kedro and pip show kedro-mlflow): 0.19.5 and 0.12.2 respectively
Python version used (python -V): 3.11.5
Operating system and version: macOS Sonoma 14.4.1

Does the bug also happen with the last version on master?

Yes, it still does.

Galileo-Galilei commented 1 month ago

This is definitely a feature, not a bug 😉 even if I acknowledge the error message could be better (but I did not imagine someone would ever do that!)

Essentially what happens is that :

you've run a first time your pipeline which has created a "regressor" in a new mlflow run 6ffe573f95124a65a04dcece1589fbdc
your run it a second time but instead of creating a new run, you log everything in the mlflow_run_id run because of
```
# mlflow.yml
tracking:
run:
id: ${runtime_params:mlflow_run_id}
```
This is very unlikely that you want to modify an existing run, but say you want to complete it
you're now trying to save a new regressor in the same run, which will corrupt the old run and definietly override the first regressor you trained. This is exactly the opposite of what you ar trying to achieve in experiment tracking: if you want to train another regressor, just create another mlflow run so you can compare the two and select the best one

However, I assume you want to load the regressor you've just trained. This is perfectly possible:

# catalog.yml
regressor:
  type: kedro_mlflow.io.models.MlflowModelTrackingDataset
  flavor: mlflow.sklearn
  run_id: ${runtime_params:mlflow_run_id}

# note that you don't specify anything in mlflow yml regarding the run id, you will either create a new mlflow run (or don't create a run if you added it to the blacklist)

and now you can run:

kedro run --pipeline inference --params mlflow_run_id=xxx (note that this is not the training pipeline, but a one that consumes the model)

(On a side note, it is likely a bad idea to deal specify the run id manually on each run, just specify a registered_model_name and let the model registry handle it for you)

Does it make sense?

astrojuanlu commented 1 month ago

My understanding from the beginning was that this is a feature 😄 The point was more around the error message.

In any case, thanks a bunch for the explanation 🙏🏼 it makes a ton of sense.

I leave it to you whether you can to keep this issue open for improving the error message, or just close it as won't fix.

(On a side note, it is likely a bad idea to deal specify the run id manually on each run, just specify a registered_model_name and let the model registry handle it for you)

You mean https://kedro-mlflow.readthedocs.io/en/stable/source/07_python_objects/01_DataSets.html#mlflowmodelregistrydataset ?

Galileo-Galilei / kedro-mlflow