Closed Galileo-Galilei closed 3 years ago
@Galileo-Galilei did you do some work on it? I would be happy to take it over and provide a solution proposition as PR. (sorry for interrupting your holidays...)
Hello @akruszewski, thank you so much for taking this one. I have some general design decisions about MlflowDataSets
in mind that I want to discuss with you.
I wish all DataSets
implemented in kedro-mlflow
respecs the following principles:
Kedro
and mlflow
Kedro
and mlflow
, the less documentation we need. If the DataSets save
and load
methods behaves in the most expected way it will enable users to refer to Kedro
and mlflow
documentation and to transition to the plugin smoothly.Kedro
and mlflow
together with multiple calls to mlflow logging function inside their nodes to move to the plugin.
save_csv
must be consistent with the one in Kedro (exactly the same arguments, with the same names). DataSet
(e.g., perfoming I/O on the disk), the more maintenance cost we will have, and it is an exponential burden because we will have to keep consistency accross our different DataSets
. See the numerous PR in Kedro for DataSets to see that is is a entire system to maintain. In general, it is easier to add a functionality than to supress one. I want to be sure that extra functionality do not have an existing clear way to be done.
save_csv
and save_json
methods do not enable to open the file in append mode, to specify the encoding, to save in a pickle format... This will raise a lot of new PR with documentation update and consistency to ensure. How far do we want to go ?MlflowMetricDataSet
only call log_metric but you absolutely need to save it on the disk too, you can either use transcoding or switch you tracking uri to a local path. In my opinion, this is really unecessary to try to manage this I/O operation by yourself since it is very easy to add it if a user want it apartIn my opinion, we should implement TWO datasets and not only one (consistency with mlflow principle):
MlflowMetricDataSet
: deal with either a float, a list of float or a dict of {step: float, value: float}. The ability to specify a step for the metrics was introduced in mlflow a long time ago. It feels natural when you log metric at each iteration step (for instance because scoring your test data is slow at each epoch for a big neural network, so you may want to evaluate the metric only every X epochs). This is not a big feature but I think it is better to be consistent with mlflow.MlflowMetricsDataSet
(with an s) : a dict of {string, one of the 3 value} : same as above, but log all the metrics in the dict (you already enable this in your current PR, I just think it is better to separate the two datasets for consistency)For the MlflowMetricDataSet
(MlflowMetricsDataSet
is completely similar), it would go like this:
_save
self.save_args["run_id"] is not None
, call log_metric to the given run_id if it exists (with a different behaviour depending on if the value is a float, a list of float or dict of {float, float})mlflow.active_run()
, call log_metric to current runself._data
(as for MemoryDataSet
) to enable further loading without querying_load
self.load_args["run_id"] is not None
, use MlflowClient() to load metric from the specified run_idself._data
is not None, use current in memory value (I put this condition in second because I think it is more coherent to query again if the run_id has changed in interactive mode)I have seen that you created a prefix
attribute in order to distinguish betwen th datasets you calculated the metric on. It makes sense to me to have such a distinction, but the prefix
might lead to further unattended complications:
key
equals to the name of the dataset in the catalog.yml
file. It would introduce more consistency between the mlflow database and the catalog.yml
. It would also enable to pull
more easily a run in the future (by creating a catalog from a mlflow run for instance, if it makes sense).key
attributes is not enough and you need the extra prefix
attribute. Either you apply the same pipeline on a different datset, and I guess you record extra informations (date, number of rows, of columns, ...) on the data inside the mlflow run to distinguish easily between runs, or you calculate the same metric on two datasets inside the same run, and they correspond to different entries in your catalog so they have differents names / keys.Do you have any idea on how we could enforce that?
@kaemo @akruszewski Does it make sense to you? Do you have strong evidences to go against this implementation? (Notice that it also have implications for how we should handle #12, but I will have another post tomorrow about this). I am completely open to discussion.
@Galileo-Galilei I agree with basically everything, but there's one use case that needs to be solved which your proposed solution does not address.
When there are two pipelines (say training and prediction) and the first one is producing a dataset/model/metric and the second one is executed as a separate run and depends on one of those artifacts (in a broad sense) there is no way to run such pipelines one after the other because you would need to specify run_id
in load_args
manually every time you run the training pipeline. This is a no-go for me because it excludes one of the most basic workflows. Pure Kedro solves this by loading the given path & latest version from the disk. With Kedro-MLflow I cannot currently run two pipelines with such dependencies one after the other in an automatic way.
I see two possible solutions:
mlflow.search_runs
ordered by run date in descending fashion until we find a run with artifact with a given name/path stored in that run and load it from there. This can lead to a long search and load time if you have many experiments plus if two different entries in Data Catalog share same artifact path we got an issue.Let me know how do you think this can be solved. Manually specifying a run_id
to load from is not a solution for me. Also, what if one was given say an MLflow model from other project and needed to load it from the local path? There is no way to do that. One would need to create an artificial workflow to put it MLflow first, get run_id
, specify it in Data Catalog and then load it from there.
I totally agree, and that's part of what I have in mind when I wrote
Notice that it also have implications for how we should handle #12, but I will have another post tomorrow about this
but I did not write this post fast enough ;)
First of all, I think that the workflow you describe (2 separated pipeline, one for training, one for prediction) only concerns artifacts (and models, which are special artifacts), but not parameters nor metrics. I don't have any common use case when you may want to retrieve parameters/metrics in another run : params which need to be reused for prediction are always stored in an object which will be stored as an artifact, and metrics are "terminal" objects : another pipeline will likely use other data and calculate other metrics.
The point you are describing is one of the major disagreement between the data scientists and the data engineers at work (they do not use this plugin but a custom made version, it does not matter here). The point is that data scientists want to perform the operation you describe (load latest version on disk without providing run_id, reuse a model from a coworker copy/pasted locally), while data engineers want this operation (providing the run_id) to be manual and the artifacts downloaded from mlflow as the single source of truth, because when they deploy in production they want an extra check after training the model. Data engineers insist that manually providing the run_id is the responsibility of the "ops guy". They really stand against just using "the last version" to avoid operational risk.
The consensus we reached is to force providing the run id when running the pipeline as a global paramer (we use the TemplatedConfigLoader
class to provide the mlflow run_id at runtime
kedro run --pipeline=prediction --run-id=RUN_ID
. This constrains exploration, but facilitates deployment (for exploration you don't have to modify the catalog since you can specify the id at runtime, so it is easier than modifying the catalog each time as you suggest).
I don't feel this is the right solution for us though, because the plugin will not be self contained and it will imply messing up with the project template which is a movin part. It will largely hurt the portability and ease to use of the plugin.
For metrics: MlflowMetricDataSet
and MlflowMetricsDataSet
, I think above suggestions are still valid
For models:
MlflowLocalModelDataSet
(not sure about the name, but I want to distinguish it from the datasets that log in mlflow) whose save
method call mlflow.save_model
on the disk and whose load
method loads from the disk (no logging involved here)MlflowModelDataSet
that performs both saving and logging, but we still have to define its load
method and it will likely be redundant with other impmentation (see further)For artifacts, we should change the current MlflowDataSet
:
`MlflowArtifactDataSet
for consistency with the othersself.load_args["run_id"]
is provided load from mlflow, else load from local self.filepath, else fail.This implementation would also enable to have the following entry in the catalog:
my_model:
type: kedro_mlflow.io.MlflowArtifactDataSet
data_set:
type: kedro_mlflow.io.MlflowLocalModelDataSet # or any valid kedro DataSet
filepath: /path/to/a/LOCAL/destination/folder # must be a local folder
so it would made point 2.ii irrelevant, since it will be completely redundant with above entry.
Would it match all your needs if we do it this way ?
Hi @Galileo-Galilei, thanks for your review! I think that I implemented most of the things, but I have also few topics to cover in discussion. If I omitted something, please point it here.
I pushed today branch with the second implementation of this issue. In this one MlflowMetricsDataSet:
run_id
), I'm still thinking that there should be a possibility to pass run_id once as an argument to the constructor.dict
with floats
, lists of floats
, dict
with value
and step
keys, or list
with dicts
with value
and step
keys. (we should consider adding timestamp
key, as this is the third argument to log_metric
),As you mention in the comment to my original PR, we should limit side-effects to bare minimum. as logging to MLflow is our mine task (and side effects in terms of function purity), we probably should avoid putting it in the _data
instance attribute.
The second argument would be that MlflowMetricsDataSet is not an in-memory dataset, but rather a dataset that is persistent.
Of course, if you are thinking that it is still better to have the _data
instance attribute, I will add it, And I will be happy to discuss it further.
prefix
attribute?I will describe a scenario where it is useful.
Let assume that we are training two models. For both of them, we have just one reusable node which evaluates models, which returns one metric: accuracy
. How to distinguish them? The only way (in my opinion, I would be happy to hear about different solutions, because, to be honest, I don't like mine) is to define a prefix.
The best scenario would be of course take the dataset name from Kedro Catalog, but I didn't found a way to do that (and I believe that there is non).
@Galileo-Galilei I forgot to mention that in my opinion there is no point in doing the second dataset MlflowMetricDataset
, as it would be almost identical to this one, and there is no semantic differences in log one or multiple metrics. You can utilize for that purpose the same dataset (as you are doing the same thing with CSV files, no matter if you have one or multiple rows in it).
PR: https://github.com/Galileo-Galilei/kedro-mlflow/pull/49
@kaemo @Galileo-Galilei I have also idea, let me know what do you think about it. If you would find a time, I would be happy to have a live session (chat/video chat/another live channel of communication), where we could discuss this topic.
Hello, I agree on almost everything.
Some comments:
uses load_args and save_args arguments (as for now just one load/save arg is used: run_id), I'm still thinking that there should be a possibility to pass run_id once as an argument to the constructor.
The more I think about it, the more I agree with you. My first idea was to enable the possibility to load from one run and log in another because some data scientists do this manually for some artifacts/models (as @kaemo suggested above, they share model locally during experimentation phase even if it sounds a bad practice for further productionizing). However :
run_id
have here the same role as the filepath
and it would be much more consistent with Kedro
to pass it to the constructor.Conclusion: let's pass run_id to contructor
(actually I will change it, it shouldn't fail, but it should create new run with log_metric, as this is the default behavior of MLflow)
Agreed, let's keep mlflow behaviour even if I don't like it and think like you that it should rather fail. It should not have any impact while running a pipeline in comand line (because hooks properly manage run opening and closing), but it will change behaviour in interactive mode.
It can take metrics as dict with floats, lists of floats, dict with value and step keys, or list with dicts with value and step keys.
It should also handle "float" only, shouldn't it?
(we should consider adding timestamp key, as this is the third argument to log_metric)
I wish we could, but I don't think we can pass the timestamp key as an argument in log_metric unfortunately according to mlflow documentation
we should limit side-effects to bare minimum [...] we probably should avoid putting it in the _data instance attribute.
Agreed. My idea here was to avoid an extra http connexion when loading from a remote database but it is really not a big issue and avoiding side effects is more important to me.
For both of them, we have just one reusable node which evaluates models, which returns one metric: accuracy Yes, I totally understand that.
The best scenario would be of course take the dataset name from Kedro Catalog, but I didn't found a way to do that (and I believe that there is non).
I totally agree that it would be much better to retrieve the name of the DataCatalog. I think we can achieve it the following way:
prefix
and keep a key
attribute after_catalog_created
method in the MlflowPipelineHook
that modifies on the fly the MlflowMetricsDataSets
to add the names in the catalog as the key
attribute of the MlflowMetricsDataSets
if the key
attribute is not provided.key
attributes.I think that having automatic consistency with the DataCatalog is a fair compensation for the additional complexity/side effect introduced by such an implementation.
there is no point in doing the second dataset MlflowMetricDataset, as it would be almost identical to this one, and there is no semantic differences in log one or multiple metrics.
Agreed, it will introduce too much code redundancy for very little additional gain.
P.S: The call is a very good idea. I've sent you linkedin invitation to exchange privately our coordinates.
@kaemo,
Search through the runs using mlflow.search_runs ordered by run date in descending fashion until we find a run with artifact with a given name/path stored in that run and load it from there.
I forgot to write it but using the most recent runs for loading is completely out of the possible solutions. Indeed I've learnt that some teams use a common mlflow for all data scientists (unlike my team where all data scientists have their own they can handle as they want+ a shared one for sharing models where training is triggered by CI/CD). This leads to conflicting writing issues (several runs can be launched by different data scientists at the same time. I feel that it is a very bad decision (and they complain that their mlflow is a total mess), but it is still what they use right now and we cannot exclude the possibility that even for my team the shared mlflow can have conflicts if several runs are launched concurrently (especially when models are long to train, e.g. deep learning models)
Context
As of today,
kedro-mlflow
offers a clear way to log parameters (through aHook
) and artifacts (through theMlflowArtifactDataSet
class in thecatalog.yml
.However, there is no weel-defined way to log metrics automatically in mlflow within the plugin. The user still have to log the metrics directly by calling
log_metric
within its self-defined function. this is not very convenient nor parametrizable, and makes the code lesss portable and messier.Feature description
Provide a unique and weel defined way to og metrics through the plugin.
Possible Implementation
The easiest implemation would be to create a
MlflowMetricDataSet
very similar toMlflowArtifactDataSet
to enable logging the metric directly inthecatalog.yml
.The main problem of this approach is that some metrics evolve over time, and we would like to log the metric on each update. This is not possible with this approach because the updates are made inside the node (when it is running), and not at the end.