Closed pmjklemm closed 5 months ago
Hello,
The "M" in the CIGAR can stand for either a match or a mismatch. We have had users express similar confusions in the past, so we added the --xeq flag in bowtie2 that disambiguates matches (=) from mismatches (X). Using your sample input as an example:
./bowtie2-align-s -x target -U tworeads.fastq --xeq
NS500339:346:H3W22BGXC:1:11301:7633:19142 0 tRNA-Glu--UUC_3_4_with_60_upstream_region 1 0 3X7I65= * 0 0 CGCTTTCTATTAATGAAAGCTCACCCTACACGAAAATATCACGCAACGCGTGATAAGCAATTTTCGTGTCCCCTT AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEAEAEEEEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEEE AS:i:-41 XN:i:0 XM:i:3 XO:i:1 XG:i:7 NM:i:10 MD:Z:0T0A0T65 YT:Z:UU
Here the CIGAR shows that there are 3 mismatches followed by 7 insertions and 65 matches. Applying this edit to the reference sequence results in the correct sequence.
CGCTTTCTATTAATGAAAGCTCACCCTACACGAAAATATCACGCAACGCGTGATAAGCAATTTTCGTGTCCCCTT
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TAT-------TAATGAAAGCTCACCCTACACGAAAATATCACGCAACGCGTGATAAGCAATTTTCGTGTCCCCTTCGTCTAGAGGCCCAGGACACCGCCCTTTCACGGCGGTAACAGGGGTTCGAATCCCCTAGGGG
I hope this helps.
Ahh ok, thank you for clarifying this @ch4rr0! In the future I will always use the --xeq option, this seems very helpful!
Just out of curiosity:
Isn't the alignment 10I65=
much better or why do the first 3 mismatches get favored over a longer insertion?
Hello,
Sorry for the delay in answering your question. The aligner will always choose the alignment with the highest alignment score, so the edit in question must have had a lower alignment score than what was chosen. You can refer to the section on The bowtie2 aligner in the manual for more information.
Thank you! I see the alignment with --local
is 7S68M
which makes way more sense.
Otherwise the global default forces an initial match/mismatch, which makes the whole alignment worse as the first 3 missmatches could match later on.
Thank you for your help !
bowtie2 v2.5.1
greetings,
set-up: I have unpaired reads as fastq. If I map these to a target.fna, I got a hit that matches (CIGAR) at 5' although the read does not match at that position.
system calls:
results:
CIGAR => 3M7I65M
target.fna:
-> Reference starts with TAT not with CGC, which are marked as matches (M)
here are the minimal data to reproduce this (<1KB): data.zip
Maybe these M should be marked as soft-clipping? Or am I doing something wrong?
Another thing that caught my eye: If I try to map only a single read (fastq) then I get nothing.