broadinstitute / pilon

Pilon is an automated genome assembly improvement and variant detection tool
GNU General Public License v2.0
340 stars 60 forks source link

Expected memory usage #125

Closed nhartwic closed 3 years ago

nhartwic commented 4 years ago

Can anyone state the algorithmic memory usage of Pilon? What factors influence memory usage and to what degree? For the sake of argument, assume K coverage and that reads map uniformly and uniquely. Does reference size matter? Does reference contiguity matter? Does number of threads matter? How does memory usage scale with read length and depth?

Is this information published anywhere?

fergsc commented 3 years ago

Echoing this request. I am currently trying to polish a 500 Mbp genome with 30x coverage and have run out of memory on a 3000GB HPC node.

w1bw commented 3 years ago

Memory usage is mostly driven by the size of the genome, though it depends on whether pilon is trying to do things which require local reassembly (the default). When reassembly is on, it must keep track of all read pairs which aren't aligned near one another in order, what it calls "strays", and that can create a very large in-memory data structure proportional coverage; it all depends on how well the alignments match. If you are only doing base polishing (i.e., --fix bases), the memory requirements are far less.

nhartwic commented 3 years ago

Thanks for the information. Makes sense.

fergsc commented 3 years ago

Memory usage in the order of 3000GB + seems excessive.