Sequences with lots of Ns make things run 10x slower

ctSkennerton / minced

Mining CRISPRs in Environmental Datasets

GNU General Public License v3.0

99 stars 17 forks source link

Sequences with lots of Ns make things run 10x slower #8

Closed tseemann closed 9 years ago

tseemann commented 9 years ago

I've got this report for prokka which sounds like a minced bug: https://github.com/tseemann/prokka/issues/116

I'm guessing it finds a lot of repeats in those poly-N runs!

Need to mask long poly-runs of any base?

I have a sequence around 100k bp in length, but buffered at both ends with 'N's' so the total length of the sequence is 2.8 Mbp. Prokka gets stuck "searching for CRISPR repeats", and though it still finishes, takes >10x as long as annotating a 2.8 Mbp sequence with no Ns.

ctSkennerton commented 9 years ago

Yes, the original code was designed to work on completed genomes where long runs of Ns aren't a problem. I'll look into a fix for this

tseemann commented 9 years ago

ping

ctSkennerton commented 9 years ago

Should hopefully be fixed with new version 0.2.0