CMU-SAFARI / Apollo

Apollo is an assembly polishing algorithm that attempts to correct the errors in an assembly. It can take multiple set of reads in a single run and polish the assemblies of genomes of any size. Described in the Bioinformatics journal paper (2020) by Firtina et al. at https://people.inf.ethz.ch/omutlu/pub/apollo-technology-independent-genome-assembly-polishing_bioinformatics20.pdf
GNU General Public License v3.0
27 stars 2 forks source link

high memory requirement #6

Closed HeQSun closed 4 years ago

HeQSun commented 4 years ago

Hi,

thanks for developing tool. I am running it with PacBio-read polishing (40x, 240 Mb assembly).

It seems apollo's memory requirement is dynamic -- any reason behind this? At some time point, in my case, it required more than 256 Gb = the total amount I had on my node, thus having the node swapping all the time.

I think any improvement on reducing this high mem requirement or any option to control mem requirement would be good.

thanks, Hequan

canfirtina commented 4 years ago

Hi Hequan,

High memory requirement is probably because you did not chunk the reads into smaller pieces (please see https://github.com/CMU-SAFARI/Apollo#set-of-reads ). If there are too long reads in your read set, Apollo may request may allocate a large memory space just to handle these reads. To prevent this from happening, we suggest chunking the reads into smaller pieces and then align these reads to the assembly. We have a very simple script that almost achieves what I just described:

https://github.com/CMU-SAFARI/Apollo/blob/master/utils/chunk_reads.sh

However, we would like to eliminate this requirement and perform the idea of chunking internally. We will have an update regarding this and some other feature improvements soon. Thus, I will not close this issue for now to let you know about this update.

Thanks,

Can.

HeQSun commented 4 years ago

Hi Can,

thanks for pointing out the potential problem and providing the way to solve it. I am trying that.

Yes, I think it would be more convenient for users if apollo does the chunking by default (or at least issues a warning to continue if seeing long reads in one liner).

Thank you again!

Best, Hequan

canfirtina commented 4 years ago

Hi Hequan,

You can now use -c option to perform the chunking in runtime. Default chunking size is 1000, and it can be disabled by setting -c to 0. Chunking should reduce the memory requirements greatly without noticeably hurting the accuracy. I am closing the issue now but feel free to re-open it if you observe further issues related to high memory requirements.

Thanks, Can.