Open bgruening opened 4 years ago
Cool. That sounds workable. BTW, I would like to describe what future Galaxy ML models look like in my mind. Models are saved in HDF5, including model structure and weights. A new datatype, as a subclass of HDF5, can be made specific for models. The datatype may carry a metadata file, similar to Bam, which permanently stores a JSON format of the model structure. Consequently, models can be previewed via accessing the metadata file. Tunable hyperparameters can be saved in a metadata file as well, which enables dynamical access in the hyperparameter search tool. Current HDF5 persistence utility is built on top of pickle. Therefore, sanitizer or security checker is still needed in play.
Agree this seems like the right approach for working with pickled files that we can't otherwise trust @bgruening, and it's also mentioned in the Python docs as well so likely best practices as well: https://docs.python.org/3/library/pickle.html FYI, looks like scikit-learn has a more efficient dumps/loads than pickle these days called joblib
: https://scikit-learn.org/stable/modules/model_persistence.html
Looking forward, I agree with @qiagu that moving to HDF5 files with model/pipeline attributes saved is the ideal approach. Unfortunately, it still doesn't seem like there's general support for dumping/loading non-deep-learning ML models, and writing one ourselves is quite a bit of work that I don't think is worth it right now if we implement the hmac signing approach
@jgoecks We already have had something made there. https://github.com/goeckslab/Galaxy-ML/tree/master/galaxy_ml/model_persist
Loading model using pickle...
(1.2471628189086914 s)
Dumping model using pickle...
(3.6942389011383057 s)
File size: 930712861
Dumping model to hdf5...
(3.006715774536133 s)
File size: 930729696
Loading model from hdf5...
(0.6420958042144775 s)
Pipeline(memory=None,
steps=[('robustscaler',
RobustScaler(copy=True, quantile_range=(25.0, 75.0),
with_centering=True, with_scaling=True)),
('kneighborsclassifier',
KNeighborsClassifier(algorithm='auto', leaf_size=30,
metric='minkowski', metric_params=None,
n_jobs=1, n_neighbors=100, p=2,
weights='uniform'))],
verbose=False)
https://github.com/goeckslab/Galaxy-ML/blob/master/galaxy_ml/tests/test_model_persist.py#L94-L101
Ah, good to know. How far along is this work?
The util was made earlier this year. I recall it worked for all sklearn models (pipelines) in a test. It failed to catch one numpy type from xgboost module, but that can be fixed. Since the util decomposes and recomposes objects using the same interface as pickle does, it has my trust.
This way we could even share the
SHARED_KEY_MAYBE_FROM_ENV
across usegalaxy.* and transfer models.