Closed Galileo-Galilei closed 3 years ago
Hi @Galileo-Galilei,
I'd like to work on this issue. I'm thinking that a Data Catalog entry could look like this:
my_model:
type: kedro_mlflow.io.MlflowModel
flavor: mlflow.<flavor>
path: data/06_models/my_model
save_model_args:
...
load_model_args:
...
log_model_args:
...
What do you think?
The first argument to save/log is usually a model and the second is a path so those would be populated automatically by MlflowModel
class.
The only tricky parts would be a tensorflow
flavor which has a few more required arguments and a custom pyfunc
model. I need to work on concrete examples of both to be able to come up with a solution to these challenges.
Hello @kaemo, nice to see you back there!
I totally agree, this feature is really missing to the package and I was going to address it but please go ahead.
Some remarks / questions:
MlflowModelDataSet
for consistency with the kedro API.flavor: mlflow.<flavor>
? flavor: <flavor>
seems sufficient but I may miss a point here.save
and load
methods of the Abstract DataSet class. These are the one which will be called at the beginning/end of the nodes. You can define both save_model
and log_model
arguments, as well as an extra log_model
method to the AbstractDataSet
but I think this will end up to be confusing for the end user because only the save method will be called during Pipeline
execution. So you MUST define the default behaviour of your save
method which is what you class is intended to (that is, pick either log_model
or save_model
but not both).My guess is that end user expect that MlflowModelDataSet.save
calls mlflow.log_model
rather mlflow.save_model
. save_model
is more a an helper for extra control if you want to log your model locally (and ignoring you mlflow server
configuration). Most end user want to call mlflow.log_model
because it is the method dedicated for logging your model to your mlflow configuration. Therefore I suggest to:
Kedro
load_args
and save_args
nomencalure for consistencyMlflowModelDataSet.load
methodMlflowModelDataSet.load
method which calls mlflow.log_model(**save_args)
internallyYour catalog entry would look like this:
my_model:
type: kedro_mlflow.io.MlflowModelDataSet # add DataSet for consistency
flavor: <flavor> # remove "mlflow."
path: data/06_models/my_model
load_args: # arguments to be passed to mlflow.<flavor>.load_model()
...
save_args: # arguments to be passed to mlflow.<flavor>.log_model()
...
# remove the arguments to be passed to save model
What do you think?
Hello @kaemo, nice to see you back there!
Happy to have more time to work on kedro-mlflow
!
I totally agree, this feature is really missing to the package and I was going to address it but please go ahead.
Some remarks / questions:
- I would name it
MlflowModelDataSet
for consistency with the kedro API.
Agreed, I'll name it MlflowModelDataSet
.
- Why do we need
flavor: mlflow.<flavor>
?flavor: <flavor>
seems sufficient but I may miss a point here.
I thought about this parameter as a Python module that provides load/log functions (built-in MLflow or custom flavor). So it could be mlflow.pytorch
module which has its functions to load/log PyTorch models or it could be my_Project.mlflow.pytorch_lightning
in my project which specifies custom flavor with load/log functions. In such a way, there would be no confusion whether it's a name of an MLflow flavor or a Python module path to a custom MLflow flavor and it would be coherent. Let me know what you think.
- We should define precisely what is the purpose of this class. You need to supercharge the
save
andload
methods of the Abstract DataSet class. These are the one which will be called at the beginning/end of the nodes. You can define bothsave_model
andlog_model
arguments, as well as an extralog_model
method to theAbstractDataSet
but I think this will end up to be confusing for the end user because only the save method will be called duringPipeline
execution. So you MUST define the default behaviour of yoursave
method which is what you class is intended to (that is, pick eitherlog_model
orsave_model
but not both).My guess is that end user expect that
MlflowModelDataSet.save
callsmlflow.log_model
rathermlflow.save_model
.save_model
is more a an helper for extra control if you want to log your model locally (and ignoring youmlflow server
configuration). Most end user want to callmlflow.log_model
because it is the method dedicated for logging your model to your mlflow configuration. Therefore I suggest to:
- keep to
Kedro
load_args
andsave_args
nomencalure for consistency- the load_args are passed to the
MlflowModelDataSet.load
method- the save_args are passed to the
MlflowModelDataSet.load
method which callsmlflow.log_model(**save_args)
internallyYour catalog entry would look like this:
my_model: type: kedro_mlflow.io.MlflowModelDataSet # add DataSet for consistency flavor: <flavor> # remove "mlflow." path: data/06_models/my_model load_args: # arguments to be passed to mlflow.<flavor>.load_model() ... save_args: # arguments to be passed to mlflow.<flavor>.log_model() ... # remove the arguments to be passed to save model
What do you think?
I agree, let's go with log_model
. If someone wants to just save a model, then one of the Kedro built-in/custom DataSets should be used for serialization/deserialization.
@Galileo-Galilei, there's one thing that is problematic with the above approach. After log_model
we are not able to load_model
unless we specify model_uri
which we don't have. Let me explain using an example.
Suppose we have a node train_lgbm
that outputs a model for which we have the following entry in Data Catalog:
lgbm:
type: kedro_mlflow.io.MlflowModelDataSet
flavor: mlflow.lightgbm
path: data/06_models/lgbm
It is logged using log_model
without any problems. Now, the next node is predict_lbgm
which has lgbm
as its input. To be able to load_model
we need to pass a URI in one of the supported URI formats. The problematic part is that we don't have that URI at this point: we've got a path which is a path within artifact store for a given run.
I see two possible solutions:
save_model
along with log_model
so we can reuse the model in the next step. When load_model
is run, we use local path
from Data Catalog or model_uri
if it's specified in load_args
(something like model_uri=model_uri if model_uri else path
).log_model
and add load_args: model_uri: runs:/<mlflow_run_id>/path
based on the current MLflow Run ID and the path already present in the Data Catalog entry. I'm not sure whether that's even possible as I haven't seen such option in Data Catalog API. It also "smells" like a bad practice.I thought about this parameter as a Python module that provides load/log functions (built-in MLflow or custom flavor). So it could be mlflow.pytorch module which has its functions to load/log PyTorch models or it could be my_Project.mlflow.pytorch_lightning in my project which specifies custom flavor with load/log functions.
Totally agree, I have the very same idea in mind. That said, Kedro
has the following to instantiate the DataCatalog:
AbstractDataSet.from_config
parse_dataset_definition
https://github.com/quantumblacklabs/kedro/blob/a829d06cc224a454ec49ffa93682080a9bbc96bc/kedro/io/core.py#L167-L169
I though we could append the <mlflow.>
prefix the same way, but f you find it unecessarily complicated, keep the mlflow prefix I don't really mind.
I had this problem in mind. This is indeed a design pattern, and we can choose between 2 solutions:
load
always load from mlflow and not from local, and that save
uses mlflow.log_model
. In this situation, your path
argument is passed to the artifact_path
argument of log_model
and must be a relative remote path. This has the drawback of downloading the model locally for each load
call.
The problematic part is that we don't have that URI at this point: we've got a path which is a path within artifact store for a given run.
AFAIK, you can reconstruct the model uri by retrieving the run object:
RUN_ID = self.load_args.get("run_id") or mlflow.active_run().info.run_id # either run id is provided or use the current one during pipeline execution (you don't know the run_id beffore running the pipeline)
mlflow_client=MlflowClient() # use internally mlflow.get_tracking_uri() previously set by the conf
run = mlflow_client.get_run(RUN_ID)
loaded_model = mlflow.xgboost.load_model(f"{run.info.artifact_uri}/{self.path}") # will download the model locally on a temp path (self.path="remote/model/folder/architecture", likely just "model")
(I have just tried it out, it seems to work)
I've been writing the documentation for the connected PR #56 and I realized that there's a problematic case that needs some thought - the pyfunc
flavor.
The challenge lies in passing either a python_module
or a loader_module
(depending on the custom pyfunc
model workflow you choose) to save_model
and log_model
: those need to be class instances. We are obviously not able to do that through YAML alone.
There are two solutions that I see:
my_package.mlflow_models.MyPyfuncModel
or my_package.mlflow_models.MyPyfuncModel(arg1="values")
if arguments need to be passed) and creating an instance of that class in MlflowModelDataSet._save
if flavor == "mlflow.pyfunc"
. It allows for a fully declarative approach but adds complexity to MlflowModelDataSet
and can result in unforeseen consequences. Example entry in Data Catalog could look like this:
my_custom_model:
type: kedro_mlflow.io.MlflowModelDataSet
flavor: mlflow.pyfunc
path: data/06_models/my_custom_model
save_args:
python_model: my_package.mlflow_models.MyPyfuncModel
save_model
/log_model
. MlflowModelDataSet._save
would need to check if the passed argument is a dict or not and if so then unpack that dict when passing to save_model
/log_model
. This would be rather tricky because the path
specified in Data Catalog entry could also be specified in the returned dict and we would need to do some magic to merge Data Catalog entry params with the dict.
@Galileo-Galilei, @akruszewski Let me know what you think guys. Maybe you have yet another idea. I'm leaning more towards the first solution as it seems less complicated and makes it clear to the user that the whole model specification is contained in the Data Catalog. In the case of the second solution merging save_args
with dict returned from a node is going to be a pain.
Note: I've also checked other built-in flavors for arguments that are not basic data types supported by YAML and there are more like signature
or input_example
which makes it even more important and tricky issue that we need to carefully consider.
It needs some deeper thoughts to see if something better can come out, but the second solution is a no go for me (even if I understand that it should work):
On the other hand, the first solution seems reasonable (after all, it is what kedro itself does to instantiate the dataset). If arguments are needed, a custom init_args
(or whatever the name is) should be used rather than passing the full string (my_package.mlflow_models.MyPyfuncModel(arg1="values")
is not only ugly but breaks parameter versioning). I'll try to dive deeper to see advantages and shortcomings.
I totally agree with @Galileo-Galilei and your opinion @kaemo that the first option is a better solution. As @Galileo-Galilei mentions, it would be better to pass constructor arguments in a separate argument (init_args
sounds reasonable).
The current version of
kedro-mlflow
only enable to log a a full pipeline as a mlflow model through theKedroPipelineModel
class. It would be useful to create anAbstractDataSet
class in order to enable model logging within the`catalog.yml
.