datreant / datreant.data

convenient data storage and retrieval in HDF5 for Treants
http://datreant.org/
BSD 3-Clause "New" or "Revised" License
1 stars 1 forks source link

Problem when saving a dataframe that contains pyobjects #18

Open gabrielelanaro opened 7 years ago

gabrielelanaro commented 7 years ago

I'm trying to save a dataframe that contains a "series of lists" (they correspond to ionic clusters), however there is a problem with the serialization:

t = dtr.Treant('/tmp/hello')
t.data['hello'] = pd.DataFrame({ 'lists': [[0, 1, 2], [0, 1], [10, 22]] })

TypeError: Cannot serialize the column [lists] because
its data contents are [mixed] object dtype

I found that for dataframes, the msgpack format is pretty robust and efficient, maybe we could serialize dataframes using that?

It would, however, hurt retro-compatibility

dotsdl commented 7 years ago

@gabrielelanaro this kind of DataFrame is not a good candidate for storage in HDF5 (as you found), but you could store it using datreant.data as a Python object by wrapping it in something that will trigger storage as a pickle. For example, you could do:

t = dtr.Treant('/tmp/hello')
t.data['hello'] = (pd.DataFrame({ 'lists': [[0, 1, 2], [0, 1], [10, 22]] }),)

which would make the stored object a tuple and therefore it will get pickled instead of trying to cram it into an HDF5 file.

I realize pickle is a poor format for data curation (not entirely safe since deserialized objects could do nefarious things, not robust against versions of Python, etc.) but it is the lowest-common-denominator. We could consider using msgpack instead since it's often used as a substitute for pickle, but I'm not familiar with it or the arguments for it.

Happy to shift how datreant.data works so long as we can maintain backwards compatibility for existing stores.