BenLangmead / bowtie2

A fast and sensitive gapped read aligner
GNU General Public License v3.0
638 stars 160 forks source link

putative error in CIGAR string #455

Closed pmjklemm closed 5 months ago

pmjklemm commented 6 months ago

bowtie2 v2.5.1

greetings,

set-up: I have unpaired reads as fastq. If I map these to a target.fna, I got a hit that matches (CIGAR) at 5' although the read does not match at that position.

system calls:

bowtie2-build target.fna target.bowtie2.idx
bowtie2 -x target.bowtie2.idx -U tworeads.fastq | grep "NS500339:346:H3W22BGXC:1:11301:7633:19142"

results:

2 reads; of these:
  2 (100.00%) were unpaired; of these:
    1 (50.00%) aligned 0 times
    1 (50.00%) aligned exactly 1 time
    0 (0.00%) aligned >1 times
50.00% overall alignment rate
NS500339:346:H3W22BGXC:1:11301:7633:19142       0       tRNA-Glu--UUC_3_4_with_60_upstream_region       1       255     3M7I65M *       0       0       *CGC*TTTCTATTAATGAAAGCTCACCCTACACGAAAATATCACGCAACGCGTGATAAGCAATTTTCGTGTCCCCTT        AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEAEAEEEEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEEE     AS:i:-41        XN:i:0  XM:i:3  XO:i:1  XG:i:7  NM:i:10 MD:Z:0T0A0T65   YT:Z:UU

CIGAR => 3M7I65M

target.fna:

>tRNA-Glu--UUC_3_4_with_60_upstream_region
**TAT**TAATGAAAGCTCACCCTACACGAAAATATCACGCAACGCGTGATAAGCAATTTTCGTGTCCCCTTCGTCTAGAGGCCCAGGACACCGCCCTTTCACGGCGGTAACAGGGGTTCGAATCCCCTAGGGGACGCCA

-> Reference starts with TAT not with CGC, which are marked as matches (M)

here are the minimal data to reproduce this (<1KB): data.zip

Maybe these M should be marked as soft-clipping? Or am I doing something wrong?

Another thing that caught my eye: If I try to map only a single read (fastq) then I get nothing.

ch4rr0 commented 6 months ago

Hello,

The "M" in the CIGAR can stand for either a match or a mismatch. We have had users express similar confusions in the past, so we added the --xeq flag in bowtie2 that disambiguates matches (=) from mismatches (X). Using your sample input as an example:

./bowtie2-align-s -x target -U tworeads.fastq --xeq
NS500339:346:H3W22BGXC:1:11301:7633:19142   0   tRNA-Glu--UUC_3_4_with_60_upstream_region   1   0   3X7I65= *   0   0   CGCTTTCTATTAATGAAAGCTCACCCTACACGAAAATATCACGCAACGCGTGATAAGCAATTTTCGTGTCCCCTT AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEAEAEEEEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEEE AS:i:-41    XN:i:0  XM:i:3  XO:i:1  XG:i:7  NM:i:10 MD:Z:0T0A0T65   YT:Z:UU

Here the CIGAR shows that there are 3 mismatches followed by 7 insertions and 65 matches. Applying this edit to the reference sequence results in the correct sequence.

CGCTTTCTATTAATGAAAGCTCACCCTACACGAAAATATCACGCAACGCGTGATAAGCAATTTTCGTGTCCCCTT
          |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TAT-------TAATGAAAGCTCACCCTACACGAAAATATCACGCAACGCGTGATAAGCAATTTTCGTGTCCCCTTCGTCTAGAGGCCCAGGACACCGCCCTTTCACGGCGGTAACAGGGGTTCGAATCCCCTAGGGG

I hope this helps.

pmjklemm commented 5 months ago

Ahh ok, thank you for clarifying this @ch4rr0! In the future I will always use the --xeq option, this seems very helpful!

Just out of curiosity: Isn't the alignment 10I65= much better or why do the first 3 mismatches get favored over a longer insertion?

ch4rr0 commented 5 months ago

Hello,

Sorry for the delay in answering your question. The aligner will always choose the alignment with the highest alignment score, so the edit in question must have had a lower alignment score than what was chosen. You can refer to the section on The bowtie2 aligner in the manual for more information.

pmjklemm commented 5 months ago

Thank you! I see the alignment with --local is 7S68M which makes way more sense. Otherwise the global default forces an initial match/mismatch, which makes the whole alignment worse as the first 3 missmatches could match later on.

Thank you for your help !