Link trained models with training data

rogerkuou commented 11 months ago

In the daskml example and the dnn example, we showed two cases of ML training on splitted data (per grid cell). But for now it is not very easy to connect the trained ML models back to the partition of the data.

The solution for now can be we save the spatio-temporal coordinates of the partition we used, and save this coordinates info as a metadata along with the output model.

Todos:

Update the model exportation part of the two example notebooks:
- At exportation write the space-temporal coordinates of the partition (as json file?)
- Write the path of source data file (the zarr file) also into the same (json) file
- In the dask ml notebook, try to implement the exportation in modelstore as suggested in #70. Examples notebooks can be found here.
Update the usage page of daskml and dnn with code examples of writing this information.

rogerkuou commented 11 months ago

Hi @SarahAlidoost, this is the data model linking problem we talked about in the morning. Feel free to pick this up when you are available.

SarahAlidoost commented 11 months ago

I found that a model keras can be saved/loaded in HDF5, h5py is one of the dependencies of the keras, see keras doc and tensorflow doc. This way we can add metadata to the attributes of an HDF5 file when saving a Keras model. In dnn.py module, we are using self.model.save(path_model) and next to it, hyperparameters are saved in separate pickle files. However, with hdf5 format, it is possible to save both metadata of training datasets and hyperparameters as attributes in the same file.

see draft implementation https://github.com/VegeWaterDynamics/motrainer/pull/113

VegeWaterDynamics / motrainer

Link trained models with training data #104