cerebis / bin3C

Extract metagenome-assembled genomes (MAGs) from metagenomic data using Hi-C.
GNU Affero General Public License v3.0
23 stars 7 forks source link

Switch to HDF5 based storage of intermediate data types. #34

Open cerebis opened 3 years ago

cerebis commented 3 years ago

Currently data is stored simply compressing pickled python classes.

This approacj was chosen over other serialisation methods as a good-enough and quick approach. However, as time passes and the codebase evoles, class version dependency for existing serialised instances becomes increasingly problematic. This can prevent users wishing to go back to old data and reanalyse with newer version of the software, since the class cannot be deserialised.

Either we must provide conversions between class changes or better avoid this entirely.

Therefore, bin3C should switch to using a class-agnostic and efficient means of storing intermediate analysis results (contact map, clusterings). Though we could pickle plain datatypes, an obvious candidate is HDF5, which would introduce a chunk of dependencies itself. Another alternative is to consider adopting an existing Hi-C HDF5 format, so long as these do not themselves include external class implementation details or extraneous fields not relevant to metagenomics.