Reads aligning at ends of reference sequences don't pass guards

barricklab / breseq

breseq is a computational pipeline for finding mutations relative to a reference sequence in short-read DNA resequencing data. It is intended for haploid microbial genomes (<20 Mb). breseq is a command line tool implemented in C++ and R.

GNU General Public License v2.0

149 stars 21 forks source link

If a read aligns partially to the end of a fragment, it won't pass the guard that requires 90% of its length to be mapped for it to be counted. These reads should get to count the part that extends past the end of the fragment as being "mapped" for purposes of this test! This problem can be seen in many tests that use a linear reference sequence, but it is especially bad when mapping to contigs from a de novo assembly.

Hi @jeffreybarrick,

Is there any update regarding this issue? I've been using breseq for a project where all the references are fragmented de novo assembled genomes and I think some odd results I'm seeing are due to this issue. A lot of called mutations look like the attached image, where the variants are only on reads with other variants. When I try looking for the read in the reference sequence, I often get an exact match to either the end or beginning of a different copy of a gene than the copy that breseq called the mutation.

I can account for this and remove these sites in post-processing, but I wanted to see if there's a way to account for this during the alignment step. Unfortunately we're working with a lot of taxa and there are too many gaps to close them all through Sanger sequencing.

Best, Will

barricklab / breseq

Reads aligning at ends of reference sequences don't pass guards #27