bgruening / galaxytools

:microscope::books: Galaxy Tool wrappers
MIT License
116 stars 232 forks source link

signing models generated with Galaxy #1046

Open bgruening opened 4 years ago

bgruening commented 4 years ago
model = pickle.dumps(data)
sent_digest =  hmac.new('SHARED_KEY_MAYBE_FROM_ENV', model, hashlib.sha1).hexdigest()
rec_digest = hmac.new('SHARED_KEY_MAYBE_FROM_ENV', model, hashlib.sha1).hexdigest()
if rec_digest != sent_digest:
    print('Integrity check failed')
else:
    unpickled_data = pickle.loads(model)

This way we could even share the SHARED_KEY_MAYBE_FROM_ENV across usegalaxy.* and transfer models.

qiagu commented 3 years ago

Cool. That sounds workable. BTW, I would like to describe what future Galaxy ML models look like in my mind. Models are saved in HDF5, including model structure and weights. A new datatype, as a subclass of HDF5, can be made specific for models. The datatype may carry a metadata file, similar to Bam, which permanently stores a JSON format of the model structure. Consequently, models can be previewed via accessing the metadata file. Tunable hyperparameters can be saved in a metadata file as well, which enables dynamical access in the hyperparameter search tool. Current HDF5 persistence utility is built on top of pickle. Therefore, sanitizer or security checker is still needed in play.

jgoecks commented 3 years ago

Agree this seems like the right approach for working with pickled files that we can't otherwise trust @bgruening, and it's also mentioned in the Python docs as well so likely best practices as well: https://docs.python.org/3/library/pickle.html FYI, looks like scikit-learn has a more efficient dumps/loads than pickle these days called joblib: https://scikit-learn.org/stable/modules/model_persistence.html

Looking forward, I agree with @qiagu that moving to HDF5 files with model/pipeline attributes saved is the ideal approach. Unfortunately, it still doesn't seem like there's general support for dumping/loading non-deep-learning ML models, and writing one ourselves is quite a bit of work that I don't think is worth it right now if we implement the hmac signing approach

qiagu commented 3 years ago

@jgoecks We already have had something made there. https://github.com/goeckslab/Galaxy-ML/tree/master/galaxy_ml/model_persist

Loading model using pickle...
(1.2471628189086914 s)

Dumping model using pickle...
(3.6942389011383057 s)
File size: 930712861

Dumping model to hdf5...
(3.006715774536133 s)
File size: 930729696

Loading model from hdf5...
(0.6420958042144775 s)

Pipeline(memory=None,
         steps=[('robustscaler',
                 RobustScaler(copy=True, quantile_range=(25.0, 75.0),
                              with_centering=True, with_scaling=True)),
                ('kneighborsclassifier',
                 KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                      metric='minkowski', metric_params=None,
                                      n_jobs=1, n_neighbors=100, p=2,
                                      weights='uniform'))],
         verbose=False)

https://github.com/goeckslab/Galaxy-ML/blob/master/galaxy_ml/tests/test_model_persist.py#L94-L101

jgoecks commented 3 years ago

Ah, good to know. How far along is this work?

qiagu commented 3 years ago

The util was made earlier this year. I recall it worked for all sklearn models (pipelines) in a test. It failed to catch one numpy type from xgboost module, but that can be fixed. Since the util decomposes and recomposes objects using the same interface as pickle does, it has my trust.