GATB / bcalm

compacted de Bruijn graph construction in low memory
MIT License
99 stars 20 forks source link

Optimal settings for generating unitigs #39

Closed rsuchecki closed 5 years ago

rsuchecki commented 5 years ago

This is not an issue but just a question: what would be the optimal settings for fast generation of unitigs? I expect running on a SSD or even better a ram drive should help. What about any of the options? Increasing -max-memory to reduce disk use seems to be a no-brainer, what else could help?

Given that minia uses bcalm I guess it makes sense to use use bcalm directly for this purpose or could there be any advantage in using minia for fast generation of unitigs?

rchikhi commented 5 years ago

Dear Rad,

I'd recommend using a ramdisk indeed, or a SSD at least, and we also find that some parallel network filesystems can be fast too (see by yourself the read/write speed perhaps using dd or hdparm).

Second recommendation is to look at the k-mer histogram (given by any kmer counter) to find out your optimal cutoff threshold (-min-abundance) to get rid of most erroneous kmers, since more distinct kmers means longer unitigs generation time.

The max-memory parameter actually has less influence on disk usage than one would think. As far as I recall, it's almost exclusively influencing the number of passes in the kmer counting step.

Minia won't be faster than BCALM (nor vice-versa), since, as you guessed, Minia actually calls BCALM as a submodule.

Thanks for your interest,

Rayan

rsuchecki commented 5 years ago

Thank you for clarifying all that and for all the great tools Rayan!

Rad