PPPLDeepLearning / plasma-python

PPPL deep learning disruption prediction package
http://tigress-web.princeton.edu/~alexeys/docs-web/html/
79 stars 43 forks source link

Saving trained models and their metadata for inference and reproducibility #41

Open felker opened 4 years ago

felker commented 4 years ago

Following discussion on Wednesday 2019-12-04 in FRNN group meeting in San Diego, we need to start systematically saving the best trained models for:

  1. Collaboration (no need for multiple users to waste GPU hours retraining the same models)
  2. Practical inference (@mdboyer wants a Python interface derived from performance_analysis.py that would allow a user to load a trained model and easily feed a set of shot(s) for inference, without using the bloated shot list and preprocessing pipeline that has been oriented towards training for the first phase of the project. Would enable exploratory studies about proximity to disruption, UQ, clustering, etc. This is an important intermediate step to setting up the C-based real-time inference tool in the PCS. )
  3. Reproducibility

As a part of a broader effort towards improving reproducibility of our workflow, these models should be stored with:

Given the binary .h5 and .npz files, we probably don't want to use VCS to store everything. But we might want to version control the plain-text metadata about the trained models. Store in this repository alongside the code? Or a new repository under our GitHub organization?

Also, should we consider ONNX?

mdboyer commented 4 years ago

Initially maybe archive both ONNX and h5 since we may use either for PCS deployment.

I'd advocate saving normalization as txt/h5 instead of npz to facilitate reading by PCS.

Better yet, could the normalization just be added as a layer to the model post-training so it is saved in the ONNX/H5 file? This would make implementation of the inference even simpler since the unnormalized data could be used as input to the deployed model.

Text files for signals names would also be easier for use in PCS.

I would think having some example trained models in the main repo would be useful, but maybe a larger library of models could be maintained separately?