FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
386 stars 101 forks source link

MAPQ Score for perfectly aligned sequence #482

Closed aayushraman closed 2 years ago

aayushraman commented 2 years ago

Hi Felix,

I hope you are doing well. I am using the bismark v0.22.3. I have a question regarding the MAPQ score for the perfectly mapped reads. The MPAQ for one read is 40, and the other is 0, even though both have a high Phred score and have an exact mapping cigar sequence of 295M. Can you please elaborate on why this is the reason? Thanks. Appreciate it.

Here are the fastq reads:

@FFSP241:66:BPN80002-2834:1:1103:4490:2770 1:N:0:5
CAAGGCGGGTAATTTTTAAAGTTAGGTGGAAATGTTTGGTTTGAATTGTATATATGTTAGGGTGGAGAGTGAGAACGAATGTAAAGAGAGGTTAGCGTTGGGGTGGGGTTGTCGGGGTAAAGGGATAGGTTCGGGGGAGGTTCGGGTTGTTACGGTTAGAAAGTATGGAAAGAGTTTAAGTAAAATAGTTGCGTGTCGGTAGGGCGCGGTGGTTTACGTTTGTAATTTTAGCATTTTGGAAGGTCGAGGCGGGCGGATGATTTCAGGTCGGGAGTTTTAGATTAGTTTGATTAAT
+
:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFFFFFFF:FF:F:FF,FFFFFFFFF,FF:FFFFFFF:F:FFFFFFF,F,F,:FFF:F,F::FFFFFF:,FF,:FF,,F:F,:FFFF:F:,F,:::,:FF,,,::FFF:F:FFF::,FF:FFF::F::

@FFSP241:66:BPN80002-2834:1:1114:8200:1450 1:N:0:5
TAAGGCGGGTAATTTTTAAAGTTAGGTGGAAACGTTCGGTTTGAATTGTATATATGTTAGGGTGGAGAGTGAGAATGAATGTAAAGAGAGGTTAGTGTTGGGGTGGGGTTGTTGGGGTAAAGGGATAGGTTCGGGGGAGGTTCGGGTCGTTACGGTTAGAAAGTATGGAAAGAGTTTAAGTAAAATAGTTGCGTGTCGGTAGGGCATGGTGGTTTATGTTTGTAATTTTAGTATTTTGGAAGGTTGAGCCGGGCGGATGATTTAAGCTTGGGAGTTTTAAATTAATTTAATTAAT
+
FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFF:F:FF:FFFF::FFF,FFFFF:FFFF,,FF:FFFFFFFFFF,FFFFFF:,::FFFF,,:FF,FF,::,:::FFFF:FF,F:::,F:FF,FFFF,,FFF,,F,F,:,,FF,F,,::,:

Here is the reference reads is:

>ref NNNNNNNNNNCTGGAGAGGAAGGCAACTGAGGGCTGCGGGAGGTGAGGAGGAGGCGTGGACGCCAGGAGTTGTTTTGCTCACACGCCCCGCGGCGTCCCGCCTAGGTGAGCTCCGGACCCTCCCGCCCCACCCCAGCGGGGCTGCTCGACTCCGTCGTGCCCCCTTGGGCTTCCACTGCAGGAATGCAAGGCGGGTAACCTTCAAAGCCAGGTGGAAACGTCCGGTTTGAACCGCACATATGCTAGGGTGGAGAGTGAGAACGAATGCAAAGAGAGGCTAGCGTTGGGGTGGGGCTGCCGGGGCAAAGGGACAGGCCCGGGGGAGGCCCGGGCCGCCACGGTTAGAAAGCATGGAAAGAGCTTAAGTAAAATAGCTGCGTGTCGGCAGGGCGCGGTGGCTCACGCCTGTAATCTCAGCACTTTGGGAGGCCGAGGCGGGCGGATGACCTGAGGTCGGGAGTTCCAGACCAGCCTGACCAACATGGAGNNNNNNNNNN

and here is the aligned sequence:

FFSP241:66:BPN80002-2834:1:1103:4490:2770_1:N:0:5   0   ref 187 40  295M    *   0   0   CAAGGCGGGTAATTTTTAAAGTTAGGTGGAAATGTTTGGTTTGAATTGTATATATGTTAGGGTGGAGAGTGAGAACGAATGTAAAGAGAGGTTAGCGTTGGGGTGGGGTTGTCGGGGTAAAGGGATAGGTTCGGGGGAGGTTCGGGTTGTTACGGTTAGAAAGTATGGAAAGAGTTTAAGTAAAATAGTTGCGTGTCGGTAGGGCGCGGTGGTTTACGTTTGTAATTTTAGCATTTTGGAAGGTCGAGGCGGGCGGATGATTTCAGGTCGGGAGTTTTAGATTAGTTTGATTAAT :FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFFFFFFF:FF:F:FF,FFFFFFFFF,FF:FFFFFFF:F:FFFFFFF,F,F,:FFF:F,F::FFFFFF:,FF,:FF,,F:F,:FFFF:F:,F,:::,:FF,,,::FFF:F:FFF::,FF:FFF::F:: NM:i:52 MD:Z:12C0C2C4C0C9C2C0C8C0C1C1C5C24C9C16C2C5C7C3C0C9C0C4C0C1C0C12C10C13C10C12C1C3C0C6C1C4C5G3C16C0C1G12C0C3C0C2C0C3C0C2C0    XM:Z:H....Z......hh..h....hx.........z..xz........xz.h.h.....h..................Z.....h.........h...Z............x..xZ....h.......x...hxZ........hxZ...xz.hh.Z..........h..........h.............x..Z....Z..x....Z.Z.....h.h.Z.hx......h.x..H.h.........xZ....Z...Z......hx......Z.......hx...hx..hx...hh..h    XR:Z:CT XG:Z:CT XA:Z:2

FFSP241:66:BPN80002-2834:1:1114:8200:1450_1:N:0:5   0   ref 187 0   295M    *   0   0   TAAGGCGGGTAATTTTTAAAGTTAGGTGGAAACGTTCGGTTTGAATTGTATATATGTTAGGGTGGAGAGTGAGAATGAATGTAAAGAGAGGTTAGTGTTGGGGTGGGGTTGTTGGGGTAAAGGGATAGGTTCGGGGGAGGTTCGGGTCGTTACGGTTAGAAAGTATGGAAAGAGTTTAAGTAAAATAGTTGCGTGTCGGTAGGGCATGGTGGTTTATGTTTGTAATTTTAGTATTTTGGAAGGTTGAGCCGGGCGGATGATTTAAGCTTGGGAGTTTTAAATTAATTTAATTAAT FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFF:F:FF:FFFF::FFF,FFFFF:FFFF,,FF:FFFFFFFFFF,FFFFFF:,::FFFF,,:FF,FF,::,:::FFFF:FF,F:::,F:FF,FFFF,,FFF,,F,F,:,,FF,F,,::,: NM:i:64 MD:Z:0C11C0C2C4C0C12C9C0C1C1C5C18C5C9C3C12C2C0C4C7C3C0C9C0C4C2C0C12C10C13C10C5G0C5C1C1C1C0C6C1C2C1C5G3C0C3G11C0C1G2G1C7C0C1G1C0C1G0C0C1G1C0C2C0 XM:Z:h....Z......hh..h....hx.........Z..xZ........xz.h.h.....h..................z.....h.........h...z............x..xz....h.......x...hxZ........hxZ...xZ.hh.Z..........h..........h.............x..Z....Z..x....Z.z.....h.h.z.hx......h.x..h.h.........xz....Z...Z......hx......z.......hx...hx..hx...hh..h    XR:Z:CT XG:Z:CT XA:Z:8

Best, -Ar

FelixKrueger commented 2 years ago

The reason for this is that the first sequence has 2 (non-bisulfite) mismatches, while the second sequence has 8 mismatches (see AS:i:-12 vs AS:i:-48). Here is the original Bowtie2 output for the top strand reference only:

@PG ID:bowtie2  PN:bowtie2  VN:2.4.5    CL:"/bi/apps/bowtie2/2.4.5/bowtie2-align-s --wrapper basic-0 -q --score-min L,0,-0.2 --ignore-quals --norc -x reference/Bisulfite_Genome/CT_conversion/BS_CT test.fastq_C_to_T.fastq"
FFSP241:66:BPN80002-2834:1:1103:4490:2770_1:N:0:5   0   ref_CT_converted    187 40  295M    *   0   0   TAAGGTGGGTAATTTTTAAAGTTAGGTGGAAATGTTTGGTTTGAATTGTATATATGTTAGGGTGGAGAGTGAGAATGAATGTAAAGAGAGGTTAGTGTTGGGGTGGGGTTGTTGGGGTAAAGGGATAGGTTTGGGGGAGGTTTGGGTTGTTATGGTTAGAAAGTATGGAAAGAGTTTAAGTAAAATAGTTGTGTGTTGGTAGGGTGTGGTGGTTTATGTTTGTAATTTTAGTATTTTGGAAGGTTGAGGTGGGTGGATGATTTTAGGTTGGGAGTTTTAGATTAGTTTGATTAAT :FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFFFFFFF:FF:F:FF,FFFFFFFFF,FF:FFFFFFF:F:FFFFFFF,F,F,:FFF:F,F::FFFFFF:,FF,:FF,,F:F,:FFFF:F:,F,:::,:FF,,,::FFF:F:FFF::,FF:FFF::F:: AS:i:-12    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:239G23G31  YT:Z:UU
FFSP241:66:BPN80002-2834:1:1114:8200:1450_1:N:0:5   0   ref_CT_converted    187 0   295M    *   0   0   TAAGGTGGGTAATTTTTAAAGTTAGGTGGAAATGTTTGGTTTGAATTGTATATATGTTAGGGTGGAGAGTGAGAATGAATGTAAAGAGAGGTTAGTGTTGGGGTGGGGTTGTTGGGGTAAAGGGATAGGTTTGGGGGAGGTTTGGGTTGTTATGGTTAGAAAGTATGGAAAGAGTTTAAGTAAAATAGTTGTGTGTTGGTAGGGTATGGTGGTTTATGTTTGTAATTTTAGTATTTTGGAAGGTTGAGTTGGGTGGATGATTTAAGTTTGGGAGTTTTAAATTAATTTAATTAAT FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFF:F:FF:FFFF::FFF,FFFFF:FFFF,,FF:FFFFFFFFFF,FFFFFF:,::FFFF,,:FF,FF,::,:::FFFF:FF,F:::,F:FF,FFFF,,FFF,,F,F,:,,FF,F,,::,: AS:i:-48    XN:i:0  XM:i:8  XO:i:0  XG:i:0  NM:i:8  MD:Z:205G33G8G14G2G12G4G3G6YT:Z:UU

Since the MAPQ score is a mix between uniqueness of the sequence, as well as the number of mismatches (and/or indels), as a function of the --score-min function, this would be the reason. We personally do not filter for MAPQ scores, as the very strict scores employed by Bismark by default (--score_min L,0,-0.2) aggravates this situation. As an example, if you relax the mapping criteria, the MAPQ score changes markedly:

--score_min L,0,-0.4:

@PG ID:bowtie2  PN:bowtie2  VN:2.4.5    CL:"/bi/apps/bowtie2/2.4.5/bowtie2-align-s --wrapper basic-0 -q --score-min L,0,-0.4 --ignore-quals --norc -x reference/Bisulfite_Genome/CT_conversion/BS_CT test.fastq_C_to_T.fastq"
FFSP241:66:BPN80002-2834:1:1103:4490:2770_1:N:0:5   0   ref_CT_converted    187 42  295M    *   0   0   TAAGGTGGGTAATTTTTAAAGTTAGGTGGAAATGTTTGGTTTGAATTGTATATATGTTAGGGTGGAGAGTGAGAATGAATGTAAAGAGAGGTTAGTGTTGGGGTGGGGTTGTTGGGGTAAAGGGATAGGTTTGGGGGAGGTTTGGGTTGTTATGGTTAGAAAGTATGGAAAGAGTTTAAGTAAAATAGTTGTGTGTTGGTAGGGTGTGGTGGTTTATGTTTGTAATTTTAGTATTTTGGAAGGTTGAGGTGGGTGGATGATTTTAGGTTGGGAGTTTTAGATTAGTTTGATTAAT :FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFFFFFFF:FF:F:FF,FFFFFFFFF,FF:FFFFFFF:F:FFFFFFF,F,F,:FFF:F,F::FFFFFF:,FF,:FF,,F:F,:FFFF:F:,F,:::,:FF,,,::FFF:F:FFF::,FF:FFF::F:: AS:i:-12    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:239G23G31  YT:Z:UU
FFSP241:66:BPN80002-2834:1:1114:8200:1450_1:N:0:5   0   ref_CT_converted    187 23  295M    *   0   0   TAAGGTGGGTAATTTTTAAAGTTAGGTGGAAATGTTTGGTTTGAATTGTATATATGTTAGGGTGGAGAGTGAGAATGAATGTAAAGAGAGGTTAGTGTTGGGGTGGGGTTGTTGGGGTAAAGGGATAGGTTTGGGGGAGGTTTGGGTTGTTATGGTTAGAAAGTATGGAAAGAGTTTAAGTAAAATAGTTGTGTGTTGGTAGGGTATGGTGGTTTATGTTTGTAATTTTAGTATTTTGGAAGGTTGAGTTGGGTGGATGATTTAAGTTTGGGAGTTTTAAATTAATTTAATTAAT FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFF:F:FF:FFFF::FFF,FFFFF:FFFF,,FF:FFFFFFFFFF,FFFFFF:,::FFFF,,:FF,FF,::,:::FFFF:FF,F:::,F:FF,FFFF,,FFF,,F,F,:,,FF,F,,::,: AS:i:-48    XN:i:0  XM:i:8  XO:i:0  XG:i:0  NM:i:8  MD:Z:205G33G8G14G2G12G4G3G6YT:Z:UU

--score_min L,0,-0.6:

@PG ID:bowtie2  PN:bowtie2  VN:2.4.5    CL:"/bi/apps/bowtie2/2.4.5/bowtie2-align-s --wrapper basic-0 -q --score-min L,0,-0.6 --ignore-quals --norc -x reference/Bisulfite_Genome/CT_conversion/BS_CT test.fastq_C_to_T.fastq"
FFSP241:66:BPN80002-2834:1:1103:4490:2770_1:N:0:5   0   ref_CT_converted    187 42  295M    *   0   0   TAAGGTGGGTAATTTTTAAAGTTAGGTGGAAATGTTTGGTTTGAATTGTATATATGTTAGGGTGGAGAGTGAGAATGAATGTAAAGAGAGGTTAGTGTTGGGGTGGGGTTGTTGGGGTAAAGGGATAGGTTTGGGGGAGGTTTGGGTTGTTATGGTTAGAAAGTATGGAAAGAGTTTAAGTAAAATAGTTGTGTGTTGGTAGGGTGTGGTGGTTTATGTTTGTAATTTTAGTATTTTGGAAGGTTGAGGTGGGTGGATGATTTTAGGTTGGGAGTTTTAGATTAGTTTGATTAAT :FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFFFFFFF:FF:F:FF,FFFFFFFFF,FF:FFFFFFF:F:FFFFFFF,F,F,:FFF:F,F::FFFFFF:,FF,:FF,,F:F,:FFFF:F:,F,:::,:FF,,,::FFF:F:FFF::,FF:FFF::F:: AS:i:-12    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:239G23G31  YT:Z:UU
FFSP241:66:BPN80002-2834:1:1114:8200:1450_1:N:0:5   0   ref_CT_converted    187 40  295M    *   0   0   TAAGGTGGGTAATTTTTAAAGTTAGGTGGAAATGTTTGGTTTGAATTGTATATATGTTAGGGTGGAGAGTGAGAATGAATGTAAAGAGAGGTTAGTGTTGGGGTGGGGTTGTTGGGGTAAAGGGATAGGTTTGGGGGAGGTTTGGGTTGTTATGGTTAGAAAGTATGGAAAGAGTTTAAGTAAAATAGTTGTGTGTTGGTAGGGTATGGTGGTTTATGTTTGTAATTTTAGTATTTTGGAAGGTTGAGTTGGGTGGATGATTTAAGTTTGGGAGTTTTAAATTAATTTAATTAAT FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFF:F:FF:FFFF::FFF,FFFFF:FFFF,,FF:FFFFFFFFFF,FFFFFF:,::FFFF,,:FF,FF,::,:::FFFF:FF,F:::,F:FF,FFFF,,FFF,,F,F,:,,FF,F,,::,: AS:i:-48    XN:i:0  XM:i:8  XO:i:0  XG:i:0  NM:i:8  MD:Z:205G33G8G14G2G12G4G3G6YT:Z:UU

I hope this resolves it?

aayushraman commented 2 years ago

Hi Felix,

I have two questions:

  1. If there are 2 and 8 mismatches, then why does my CIGAR sequence suggest a perfect match of 295M?
  2. What ideal score_min argument should I use for my analysis? and why? I tried a couple of different arguments and could not come up with an "ideal" scoring argument.

Thanks, -Ar

FelixKrueger commented 2 years ago

1) The CIGAR string doesn't say that match is perfect, it says the match is linear, i.e. there are no insertions or deletions. You need to look at the MD:Z: string to see mismatches, the alignment score (AS:i:0 would be a perfect match), and maybe the hamming distance (NM I believe) if you want more detailed informtion.

2) I am not sure if there is a generally accepted definition of "ideal". I personally tend to use Bismark with its stringent default scoring but without any subsequent MAPQ filtering (assuming that the mapping takes care of worst multi-mapping reads alrteady).

aayushraman commented 2 years ago

Thanks, @FelixKrueger. Appreciate it.

aayushraman commented 2 years ago

Hi @FelixKrueger,

I went through the MD:Z string, and I am not able to understand what does the following string means: MD:Z:0C11C0C2C4C0C12C9C0C1C1C5C18C5C9C3C12C2C0C4C7C3C0C9C0C4C2C0C12C10C13C10C5G0C5C1C1C1C0C6C1C2C1C5G3C0C3G11C0C1G2G1C7C0C1G1C0C1G0C0C1G1C0C2C0 It is the 13th column of the aligned bam file. Thanks!

FelixKrueger commented 2 years ago

This is an auxiliary flag, you can read up on it in the SAM specification if you are interested (https://samtools.github.io/hts-specs/SAMtags.pdf).

More specifically:

MD:Z:[0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)*
String encoding mismatched and deleted reference bases, used in conjunction with the CIGAR and
SEQ fields to reconstruct the bases of the reference sequence interval to which the alignment has been
mapped. This can enable variant calling without requiring access to the entire original reference.
The MD string consists of the following items, concatenated without additional delimiter characters:
• [0-9]+, indicating a run of reference bases that are identical to the corresponding SEQ bases;
• [A-Z], identifying a single reference base that differs from the SEQ base aligned at that position;
• \^[A-Z]+, identifying a run of reference bases that have been deleted in the alignment.
As shown in the complete regular expression above, numbers alternate with the other items. Thus if two
mismatches or deletions are adjacent without a run of identical bases between them, a ‘0’ (indicating
a 0-length run) must be used to separate them in the MD string.
Clipping, padding, reference skips, and insertions (‘H’, ‘S’, ‘P’, ‘N’, and ‘I’ CIGAR operations) are not
represented in the MD string. When reconstructing the reference sequence, inserted and soft-clipped
SEQ bases are omitted as determined by tracking ‘I’ and ‘S’ operations in the CIGAR string. (If the
CIGAR string contains ‘N’ operations, then the corresponding skipped parts of the reference sequence
cannot be reconstructed.)
For example, a string ‘10A5^AC6’ means from the leftmost reference base in the alignment, there are
10 matches followed by an A on the reference which is different from the aligned read base; the next 5
reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC;
the last 6 bases are matches