BenLangmead / bowtie2

A fast and sensitive gapped read aligner
GNU General Public License v3.0
675 stars 159 forks source link

bowtie2 core dumps on references with long homopolymer runs #18

Closed jorvis closed 9 years ago

jorvis commented 9 years ago

I'm aligning (in batches) around 20TB of read data against several thousand microbial genomes. Some of these batches fail with core dumps after a very long runtime (around 10x as long as those that are successful.) I've tried looking into why only certain batches fail, and what I've found is that the genomes it fails on are those which contain long (likely incorrect) homopolymeric repeats. One example is:

gi|257136525|ref|NZ_GG699286.1| Xanthomonas campestris pv. vasculorum NCPPB702 genomic scaffold scf_7293_715, whole genome shotgun sequence

Examples of the homopolymeric stretches:

WARNING: Sequence ID gi|257136525|ref|NZ_GG699286.1| contains a homopolymer run (T) of length 45972 WARNING: Sequence ID gi|257136529|ref|NZ_GG699290.1| contains a homopolymer run (A) of length 131072 WARNING: Sequence ID gi|257136550|ref|NZ_GG699311.1| contains a homopolymer run (T) of length 51385 WARNING: Sequence ID gi|257136550|ref|NZ_GG699311.1| contains a homopolymer run (A) of length 262144 WARNING: Sequence ID gi|257136567|ref|NZ_GG699328.1| contains a homopolymer run (A) of length 61064

Obviously these are incorrect sequences, but many entries like this still appear in the public entries and cause bowtie2 to fail. When I replace them with Ns, bowtie2 runs to completion. Is this a known issue with bowtie2?

(I'm using bowtie2-2.2.4)

val-antonescu commented 9 years ago

We prefer this behavior for various reasons. Bowtie2 will stop in this case usually with a SIGABRT triggered by a bad alloc. Like you already know masking the repeats before searching those sequences is the way to go.

jorvis commented 9 years ago

Can you recommend the minimum homopolymer repeat length which should be masked based on the algorithm?

On Tue, Feb 17, 2015 at 1:59 PM, val notifications@github.com wrote:

We prefer this behavior for various reasons. Bowtie2 will stop in this case usually with a SIGABRT triggered by a bad alloc. Like you already know masking the repeats before searching those sequences is the way to go.

— Reply to this email directly or view it on GitHub https://github.com/BenLangmead/bowtie2/issues/18#issuecomment-74740699.

val-antonescu commented 9 years ago

My first thought would be something like 10bp. This will avoid having bowtie wandering around too much with a seed length of 20bp.

BenLangmead commented 9 years ago

Can you provide exact parameters used for the runs that fail?

val-antonescu commented 9 years ago

I will close this issue for now.

Val