Right now, we are using SortMeRNA to align contigs against the complete ref db, and we are outputting all alignments. This can lead to very big SAM files because some very conserved contigs will have alignments against almost every ref sequence. In the next sub-step, when reading that SAM file with Python, we will load in memory all alignments of the same contig, which can lead to huge memory usage.
We can imagine several complementary solutions to reduce this RAM and disk space usage:
[ ] optimise batch SAM reading in Python, by storing only relevant data in memory
[ ] store alignments in a BAM file instead of a SAM file. Then we'll probably need a combination of samtools and pythons libraries to read it properly.
[ ] rethink our scaffolding strategy to not have to output all possible alignments. This strategy will probably improve memory usage the best, but it will change the algorithm and need to be well though of in advance.
Right now, we are using SortMeRNA to align contigs against the complete ref db, and we are outputting all alignments. This can lead to very big SAM files because some very conserved contigs will have alignments against almost every ref sequence. In the next sub-step, when reading that SAM file with Python, we will load in memory all alignments of the same contig, which can lead to huge memory usage.
We can imagine several complementary solutions to reduce this RAM and disk space usage: