Create a MlflowModelDataSet

The current version of kedro-mlflow only enable to log a a full pipeline as a mlflow model through the KedroPipelineModel class. It would be useful to create an AbstractDataSet class in order to enable model logging within the `catalog.yml.

Hi @Galileo-Galilei,

I'd like to work on this issue. I'm thinking that a Data Catalog entry could look like this:

my_model:
    type: kedro_mlflow.io.MlflowModel
    flavor: mlflow.<flavor>
    path: data/06_models/my_model
    save_model_args:
        ... 
    load_model_args:
        ...
    log_model_args:
       ...

What do you think?

The first argument to save/log is usually a model and the second is a path so those would be populated automatically by MlflowModel class.

The only tricky parts would be a tensorflow flavor which has a few more required arguments and a custom pyfunc model. I need to work on concrete examples of both to be able to come up with a solution to these challenges.

Hello @kaemo, nice to see you back there!

I totally agree, this feature is really missing to the package and I was going to address it but please go ahead.

Some remarks / questions:

I would name it MlflowModelDataSet for consistency with the kedro API.
Why do we need flavor: mlflow.<flavor>? flavor: <flavor> seems sufficient but I may miss a point here.
We should define precisely what is the purpose of this class. You need to supercharge the save and load methods of the Abstract DataSet class. These are the one which will be called at the beginning/end of the nodes. You can define both save_model and log_model arguments, as well as an extra log_model method to the AbstractDataSet but I think this will end up to be confusing for the end user because only the save method will be called during Pipeline execution. So you MUST define the default behaviour of your save method which is what you class is intended to (that is, pick either log_model or save_model but not both).

My guess is that end user expect that MlflowModelDataSet.save calls mlflow.log_model rather mlflow.save_model. save_model is more a an helper for extra control if you want to log your model locally (and ignoring you mlflow server configuration). Most end user want to call mlflow.log_model because it is the method dedicated for logging your model to your mlflow configuration. Therefore I suggest to:

keep to Kedro load_args and save_argsnomencalure for consistency
the load_args are passed to the MlflowModelDataSet.load method
the save_args are passed to the MlflowModelDataSet.load method which calls mlflow.log_model(**save_args) internally

Your catalog entry would look like this:

my_model:
    type: kedro_mlflow.io.MlflowModelDataSet # add DataSet for consistency
    flavor: <flavor> # remove "mlflow."
    path: data/06_models/my_model
    load_args:  # arguments to be passed to mlflow.<flavor>.load_model()
        ... 
    save_args: # arguments to be passed to mlflow.<flavor>.log_model()
        ...
# remove the arguments to be passed to save model

What do you think?

Hello @kaemo, nice to see you back there!

Happy to have more time to work on kedro-mlflow!

I totally agree, this feature is really missing to the package and I was going to address it but please go ahead.

Some remarks / questions:

I would name it MlflowModelDataSet for consistency with the kedro API.

Agreed, I'll name it MlflowModelDataSet.

Why do we need flavor: mlflow.<flavor>? flavor: <flavor> seems sufficient but I may miss a point here.

I thought about this parameter as a Python module that provides load/log functions (built-in MLflow or custom flavor). So it could be mlflow.pytorch module which has its functions to load/log PyTorch models or it could be my_Project.mlflow.pytorch_lightning in my project which specifies custom flavor with load/log functions. In such a way, there would be no confusion whether it's a name of an MLflow flavor or a Python module path to a custom MLflow flavor and it would be coherent. Let me know what you think.

We should define precisely what is the purpose of this class. You need to supercharge the save and load methods of the Abstract DataSet class. These are the one which will be called at the beginning/end of the nodes. You can define both save_model and log_model arguments, as well as an extra log_model method to the AbstractDataSet but I think this will end up to be confusing for the end user because only the save method will be called during Pipeline execution. So you MUST define the default behaviour of your save method which is what you class is intended to (that is, pick either log_model or save_model but not both).

My guess is that end user expect that MlflowModelDataSet.save calls mlflow.log_model rather mlflow.save_model. save_model is more a an helper for extra control if you want to log your model locally (and ignoring you mlflow server configuration). Most end user want to call mlflow.log_model because it is the method dedicated for logging your model to your mlflow configuration. Therefore I suggest to:

keep to Kedro load_args and save_argsnomencalure for consistency

the load_args are passed to the MlflowModelDataSet.load method

the save_args are passed to the MlflowModelDataSet.load method which calls mlflow.log_model(**save_args) internally

Your catalog entry would look like this:
my_model:
    type: kedro_mlflow.io.MlflowModelDataSet # add DataSet for consistency
    flavor: <flavor> # remove "mlflow."
    path: data/06_models/my_model
    load_args:  # arguments to be passed to mlflow.<flavor>.load_model()
        ... 
    save_args: # arguments to be passed to mlflow.<flavor>.log_model()
        ...
# remove the arguments to be passed to save model
What do you think?

I agree, let's go with log_model. If someone wants to just save a model, then one of the Kedro built-in/custom DataSets should be used for serialization/deserialization.

@Galileo-Galilei, there's one thing that is problematic with the above approach. After log_model we are not able to load_model unless we specify model_uri which we don't have. Let me explain using an example.

Suppose we have a node train_lgbm that outputs a model for which we have the following entry in Data Catalog:

lgbm:
    type: kedro_mlflow.io.MlflowModelDataSet
    flavor: mlflow.lightgbm
    path: data/06_models/lgbm

It is logged using log_model without any problems. Now, the next node is predict_lbgm which has lgbm as its input. To be able to load_model we need to pass a URI in one of the supported URI formats. The problematic part is that we don't have that URI at this point: we've got a path which is a path within artifact store for a given run.

I see two possible solutions:

We run save_model along with log_model so we can reuse the model in the next step. When load_model is run, we use local path from Data Catalog or model_uri if it's specified in load_args (something like model_uri=model_uri if model_uri else path).
We edit the Data Catalog during log_model and add load_args: model_uri: runs:/<mlflow_run_id>/path based on the current MLflow Run ID and the path already present in the Data Catalog entry. I'm not sure whether that's even possible as I haven't seen such option in Data Catalog API. It also "smells" like a bad practice.

Regarding the "mlflow" prefix

I thought about this parameter as a Python module that provides load/log functions (built-in MLflow or custom flavor). So it could be mlflow.pytorch module which has its functions to load/log PyTorch models or it could be my_Project.mlflow.pytorch_lightning in my project which specifies custom flavor with load/log functions.

Totally agree, I have the very same idea in mind. That said, Kedro has the following to instantiate the DataCatalog:

parse the config file to get dict datasets configuration
Create datasets with AbstractDataSet.from_config
call parse_dataset_definition https://github.com/quantumblacklabs/kedro/blob/a829d06cc224a454ec49ffa93682080a9bbc96bc/kedro/io/core.py#L167-L169
1. if it is a string, prefix it with potential default (kedro.io, kedro.extras.datasets) and try to load associated module
2. Else try to load directly the string as a module
3. If it is a valid module, download it (not useful when called within catalog.yml, but handy if you handle directly the DataSet object. https://github.com/quantumblacklabs/kedro/blob/a829d06cc224a454ec49ffa93682080a9bbc96bc/kedro/io/core.py#L416

I though we could append the <mlflow.> prefix the same way, but f you find it unecessarily complicated, keep the mlflow prefix I don't really mind.

Loading model from mlflow

I had this problem in mind. This is indeed a design pattern, and we can choose between 2 solutions:

Decide that load always load from mlflow and not from local, and that save uses mlflow.log_model. In this situation, your path argument is passed to the artifact_path argument of log_model and must be a relative remote path. This has the drawback of downloading the model locally for each load call.

The problematic part is that we don't have that URI at this point: we've got a path which is a path within artifact store for a given run.

AFAIK, you can reconstruct the model uri by retrieving the run object:

RUN_ID = self.load_args.get("run_id") or mlflow.active_run().info.run_id  # either run id is provided or use the current one during pipeline execution (you don't know the run_id beffore running the pipeline)
mlflow_client=MlflowClient()  # use internally mlflow.get_tracking_uri() previously set by the conf
run = mlflow_client.get_run(RUN_ID)
loaded_model = mlflow.xgboost.load_model(f"{run.info.artifact_uri}/{self.path}")  # will download the model locally on a temp path (self.path="remote/model/folder/architecture", likely just "model")

(I have just tried it out, it seems to work)

Use your solution 1., i.e. save the model locally and call log_model (seems an overkill here and generate extra code, but some people may fin it easier to store everything locally).

I've been writing the documentation for the connected PR #56 and I realized that there's a problematic case that needs some thought - the pyfunc flavor.

The challenge lies in passing either a python_module or a loader_module (depending on the custom pyfunc model workflow you choose) to save_model and log_model: those need to be class instances. We are obviously not able to do that through YAML alone.

There are two solutions that I see:

Passing a string (like my_package.mlflow_models.MyPyfuncModel or my_package.mlflow_models.MyPyfuncModel(arg1="values") if arguments need to be passed) and creating an instance of that class in MlflowModelDataSet._save if flavor == "mlflow.pyfunc".

It allows for a fully declarative approach but adds complexity to MlflowModelDataSet and can result in unforeseen consequences. Example entry in Data Catalog could look like this:

my_custom_model:
    type: kedro_mlflow.io.MlflowModelDataSet
    flavor: mlflow.pyfunc
    path: data/06_models/my_custom_model
    save_args:
        python_model: my_package.mlflow_models.MyPyfuncModel

Allowing to return a dict from a node with arguments for save_model/log_model. MlflowModelDataSet._save would need to check if the passed argument is a dict or not and if so then unpack that dict when passing to save_model/log_model.

This would be rather tricky because the path specified in Data Catalog entry could also be specified in the returned dict and we would need to do some magic to merge Data Catalog entry params with the dict.

@Galileo-Galilei, @akruszewski Let me know what you think guys. Maybe you have yet another idea. I'm leaning more towards the first solution as it seems less complicated and makes it clear to the user that the whole model specification is contained in the Data Catalog. In the case of the second solution merging save_args with dict returned from a node is going to be a pain.

Note: I've also checked other built-in flavors for arguments that are not basic data types supported by YAML and there are more like signature or input_example which makes it even more important and tricky issue that we need to carefully consider.

It needs some deeper thoughts to see if something better can come out, but the second solution is a no go for me (even if I understand that it should work):

Doing I/O operations (automagically, which is even worse) in kedro nodes completely break kedro's principles which is one of key idea of using this plugin.
it will make debugging very difficult
it is much more likely to be impacted by future kedro changes and to break some of the kedro features (namespaces, layers...)

On the other hand, the first solution seems reasonable (after all, it is what kedro itself does to instantiate the dataset). If arguments are needed, a custom init_args (or whatever the name is) should be used rather than passing the full string (my_package.mlflow_models.MyPyfuncModel(arg1="values") is not only ugly but breaks parameter versioning). I'll try to dive deeper to see advantages and shortcomings.

I totally agree with @Galileo-Galilei and your opinion @kaemo that the first option is a better solution. As @Galileo-Galilei mentions, it would be better to pass constructor arguments in a separate argument (init_args sounds reasonable).

Galileo-Galilei / kedro-mlflow

Create a MlflowModelDataSet #12

Regarding the "mlflow" prefix

Loading model from mlflow