Integrate compression of FASTA sequences

Currently, VizBin stores the FASTA sequences in memory. This typically works well for current (i.e., short read-based, say Illumina) metagenomic assembly results. However, third-generation sequencing (e.g., PacBio, ONT) may result in numerous AND long sequences. While this is supposedly not a problem once the composition of the sequences (k-mer profile) is computed, storing numerous AND long sequences in memory for later export (of the selected bins) is problematic.

Hence, integration of compression of the FASTA sequences is encouraged. Theoretically, current metagenomic assembly-based results should also benefit from this by a reduced memory footprint. This in turn would allow to use less memory or it would enable larger datasets to be run with restricted resources. In that sense, the CLR-transformation, dimension reduction, etc. are not the bottlenecks but rather the import and temporary storage of the input sequences.

Which algorithm to use for the compression remains to be seen!

claczny / VizBin

Integrate compression of FASTA sequences #21