Open JaimeArboleda opened 3 months ago
Hi,
first of all sorry for the long delay with no news.
This is the intended behaviour : the run_id
argument is meant to specify specifically in which run_id
you want to load
/ save
a model or an artifact (here, your df_train
) from / to. However, I understand that in a typicla ML workflow, you often want to "read" from a specific mlflow run, but not necessarily write to it.
I guess the best workaround is to use kedro envrionments to modify the run_id depending on the environment, something like
# conf/base/catalog.yml
df_train:
type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
dataset:
type: pandas.ParquetDataset
filepath: /home/f0099337/project/df_train.pq
run_id: ${globals: df_train_run_id, null} # if no df_train_run_id is specified in the environment, default to None so log in active run
and
# conf/training/globals.yml # Notice the /training folder which is a newly created environment
df_train_run_id: <29309b32d57b49dbb89d51254e055960>
Now you can :
kedro run -p preprocessing
to run the preprocessing pipeline and log in a new run idkedro run -p training -e training
to run the training pipeline and use the df_train_run_id specified in the training environmentIt would make sense in kedro-mlflow
to be able to specify a different run_id for load and save:
df_train:
type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
dataset:
type: pandas.ParquetDataset
filepath: /home/f0099337/project/df_train.pq
load_args:
run_id: ...
save_args:
run_id: ...
Hello!
First of all thanks for open sourcing your plugin, which is very well documented and a great addition to the MLOps ecosystem. We work for a medium sized organization and we are building our MLOps toolset around kedro, mlflow and your plugin.
There is a problem that we are facing now and we are unsure about the way to solve it. Maybe you have thought about it and there is a clean solution within your plugin, but we don't see it and we are trying to add some extra functionality. But I want to expose the problem because it looks like something that should be common and maybe there is a better way to deal with it.
Typically we have the following pipelines:
Of course, the output of
preprocessing
is the input oftrain
. They are different pipelines and in fact, in many of our projects, even the runtime environment is different (preprocessing
uses spark andtrain
uses pandas/numpy/xgboost and other python libraries that use in-memory computation).But in many projects we have several versions of
preprocessing
(because we might have different ways of cleaning the data, we might discard or not some particular data source and so on). We connect the two pipelines usingrun_id
.So let's say that we are happy with a particular execution of
preprocessing
. Then, in our catalog, we will add therun_id
to specify that our training dataset is the one generated by that particular run:Now, the problem is that the class
kedro_mlflow.io.artifacts.MlflowArtifactDataset
overwrites the path with this specificrun_id
if you try to execute thepreprocessing
pipeline. And this is not what we want: we would like to have the possibility of running thepreprocessing
pipeline again (and saving the result in the newrun_id
generated bymlflow
), but what happens is that if you runpreprocessing
then the result overwrites the output of therun_id
specified in the catalog, therefore "altering the history". We would like that thisrun_id
specified in the catalog only affects the version of thedf_train
that is read when executing thetrain
pipeline, but not the one that is written when runningpreprocessing
.For a minimal example, let me add a dumb
preprocessing
andtrain
pipeline:preprocessing
pipeline:train
pipeline:As I said, if I try to execute the
train
pipeline, it will correctly take thedf_train
corresponding to therun_id
that I specified in thecatalog
. However, if I execute another run of thepreprocessing
pipeline, for example with different parameters, instead of storing the result in a newrun_id
, it will overwrite therun_id
specified in the catalog.Sorry for the long message, but I tried to make the issue as clear as possible.
Thanks in advance!