saving a DataGeometry object

andrewheusser commented 7 years ago

After performing an analysis and visualizing the result, we want to save out the geo so that it can be shared or loaded in at a later time. After a little research, here are a few options:

pickle - this is the simplest way to save out an object. the downside is that its not an efficient way to store large arrays of data, and loading pickles created in one version of python (2/3) and loaded in the other is problematic. It is possible to save out versions that are compatible with each version separately (i.e. 1 file for python 2 and one for python 3). a good resource
joblib - this library appears to wrap pickle, but is more efficient at handling large array data. you can also easily compress files. the downside is that it suffers from the same cross-version incompatibility issues as pickle
json - its possible to manually turn objects into json format, and then rebuild the objects on reload. i dont think this is a great solution for us given how variable our saved files may be (e.g. 1 or more of 20+ different scikit-learn model objects). (see here for a post about converting scikit-learn objects to json)
h5 - this is an efficient file format for large amounts of array data. however, as far as i can tell, python objects can not be easily saved.
h5 + pickle/joblib - one possibility would be to save the array data in the h5 format and the rest in a pickle. we would get the benefit of storing array data with h5, and the ease of storing object data with pickle

To summarize, I don't see an elegant way to solve the cross-version (python 2/3) saving issue. So, unless we convert all the models to json and then rebuild them, we are stuck with pickle. My choice would be to go with joblib, which is like pickle but more efficient at handling large array data, and just note that you can't create a file one version of python and save it in the other.

jeremymanning commented 7 years ago

I think hdf5 objects can be saved with hd5py...

Another option would be .mat files, which can be saved and loaded using script.

jeremymanning commented 7 years ago

(I don't think breaking compatibility is a good idea if we can avoid it)

andrewheusser commented 7 years ago

I don't think python class instances can be saved with hdf5..i believe they first have to be converted to a dictionary, but i could be wrong. i think this would be ok if we had a simple class instance to save, but the fact that the DataGeometry.reduce/align/normalize/cluster fields can be scikit-learn class instances and custom written functions (where we have no idea what kinds of data structures are being utilized) makes it tricky.

there is a library called deepdish that can help to convert class instances into dictionaries to then be saved in the hd5 format. however, it looks like you have to build the class instance -> dictionary functions yourself, and we are supporting a lot of different classes (e.g. all the reduce models, cluster models, custom transforms).

http://deepdish.readthedocs.io/en/latest/io.html#class-instances

andrewheusser commented 7 years ago

here's another possibly useful solution: https://github.com/jsonpickle/jsonpickle

converts python objects to json

andrewheusser commented 7 years ago

^ the advantage to converting to json is that it is not only (python) cross-version compatible, but many other languages can handle json.

andrewheusser commented 7 years ago

went with hd5, closing this issue!

ContextLab / hypertools

saving a DataGeometry object #155