Closed aayushraman closed 2 years ago
The reason for this is that the first sequence has 2 (non-bisulfite) mismatches, while the second sequence has 8 mismatches (see AS:i:-12
vs AS:i:-48
). Here is the original Bowtie2 output for the top strand reference only:
@PG ID:bowtie2 PN:bowtie2 VN:2.4.5 CL:"/bi/apps/bowtie2/2.4.5/bowtie2-align-s --wrapper basic-0 -q --score-min L,0,-0.2 --ignore-quals --norc -x reference/Bisulfite_Genome/CT_conversion/BS_CT test.fastq_C_to_T.fastq"
FFSP241:66:BPN80002-2834:1:1103:4490:2770_1:N:0:5 0 ref_CT_converted 187 40 295M * 0 0 TAAGGTGGGTAATTTTTAAAGTTAGGTGGAAATGTTTGGTTTGAATTGTATATATGTTAGGGTGGAGAGTGAGAATGAATGTAAAGAGAGGTTAGTGTTGGGGTGGGGTTGTTGGGGTAAAGGGATAGGTTTGGGGGAGGTTTGGGTTGTTATGGTTAGAAAGTATGGAAAGAGTTTAAGTAAAATAGTTGTGTGTTGGTAGGGTGTGGTGGTTTATGTTTGTAATTTTAGTATTTTGGAAGGTTGAGGTGGGTGGATGATTTTAGGTTGGGAGTTTTAGATTAGTTTGATTAAT :FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFFFFFFF:FF:F:FF,FFFFFFFFF,FF:FFFFFFF:F:FFFFFFF,F,F,:FFF:F,F::FFFFFF:,FF,:FF,,F:F,:FFFF:F:,F,:::,:FF,,,::FFF:F:FFF::,FF:FFF::F:: AS:i:-12 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:239G23G31 YT:Z:UU
FFSP241:66:BPN80002-2834:1:1114:8200:1450_1:N:0:5 0 ref_CT_converted 187 0 295M * 0 0 TAAGGTGGGTAATTTTTAAAGTTAGGTGGAAATGTTTGGTTTGAATTGTATATATGTTAGGGTGGAGAGTGAGAATGAATGTAAAGAGAGGTTAGTGTTGGGGTGGGGTTGTTGGGGTAAAGGGATAGGTTTGGGGGAGGTTTGGGTTGTTATGGTTAGAAAGTATGGAAAGAGTTTAAGTAAAATAGTTGTGTGTTGGTAGGGTATGGTGGTTTATGTTTGTAATTTTAGTATTTTGGAAGGTTGAGTTGGGTGGATGATTTAAGTTTGGGAGTTTTAAATTAATTTAATTAAT FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFF:F:FF:FFFF::FFF,FFFFF:FFFF,,FF:FFFFFFFFFF,FFFFFF:,::FFFF,,:FF,FF,::,:::FFFF:FF,F:::,F:FF,FFFF,,FFF,,F,F,:,,FF,F,,::,: AS:i:-48 XN:i:0 XM:i:8 XO:i:0 XG:i:0 NM:i:8 MD:Z:205G33G8G14G2G12G4G3G6YT:Z:UU
Since the MAPQ score is a mix between uniqueness of the sequence, as well as the number of mismatches (and/or indels), as a function of the --score-min
function, this would be the reason. We personally do not filter for MAPQ scores, as the very strict scores employed by Bismark by default (--score_min L,0,-0.2
) aggravates this situation. As an example, if you relax the mapping criteria, the MAPQ score changes markedly:
--score_min L,0,-0.4
:
@PG ID:bowtie2 PN:bowtie2 VN:2.4.5 CL:"/bi/apps/bowtie2/2.4.5/bowtie2-align-s --wrapper basic-0 -q --score-min L,0,-0.4 --ignore-quals --norc -x reference/Bisulfite_Genome/CT_conversion/BS_CT test.fastq_C_to_T.fastq"
FFSP241:66:BPN80002-2834:1:1103:4490:2770_1:N:0:5 0 ref_CT_converted 187 42 295M * 0 0 TAAGGTGGGTAATTTTTAAAGTTAGGTGGAAATGTTTGGTTTGAATTGTATATATGTTAGGGTGGAGAGTGAGAATGAATGTAAAGAGAGGTTAGTGTTGGGGTGGGGTTGTTGGGGTAAAGGGATAGGTTTGGGGGAGGTTTGGGTTGTTATGGTTAGAAAGTATGGAAAGAGTTTAAGTAAAATAGTTGTGTGTTGGTAGGGTGTGGTGGTTTATGTTTGTAATTTTAGTATTTTGGAAGGTTGAGGTGGGTGGATGATTTTAGGTTGGGAGTTTTAGATTAGTTTGATTAAT :FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFFFFFFF:FF:F:FF,FFFFFFFFF,FF:FFFFFFF:F:FFFFFFF,F,F,:FFF:F,F::FFFFFF:,FF,:FF,,F:F,:FFFF:F:,F,:::,:FF,,,::FFF:F:FFF::,FF:FFF::F:: AS:i:-12 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:239G23G31 YT:Z:UU
FFSP241:66:BPN80002-2834:1:1114:8200:1450_1:N:0:5 0 ref_CT_converted 187 23 295M * 0 0 TAAGGTGGGTAATTTTTAAAGTTAGGTGGAAATGTTTGGTTTGAATTGTATATATGTTAGGGTGGAGAGTGAGAATGAATGTAAAGAGAGGTTAGTGTTGGGGTGGGGTTGTTGGGGTAAAGGGATAGGTTTGGGGGAGGTTTGGGTTGTTATGGTTAGAAAGTATGGAAAGAGTTTAAGTAAAATAGTTGTGTGTTGGTAGGGTATGGTGGTTTATGTTTGTAATTTTAGTATTTTGGAAGGTTGAGTTGGGTGGATGATTTAAGTTTGGGAGTTTTAAATTAATTTAATTAAT FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFF:F:FF:FFFF::FFF,FFFFF:FFFF,,FF:FFFFFFFFFF,FFFFFF:,::FFFF,,:FF,FF,::,:::FFFF:FF,F:::,F:FF,FFFF,,FFF,,F,F,:,,FF,F,,::,: AS:i:-48 XN:i:0 XM:i:8 XO:i:0 XG:i:0 NM:i:8 MD:Z:205G33G8G14G2G12G4G3G6YT:Z:UU
--score_min L,0,-0.6
:
@PG ID:bowtie2 PN:bowtie2 VN:2.4.5 CL:"/bi/apps/bowtie2/2.4.5/bowtie2-align-s --wrapper basic-0 -q --score-min L,0,-0.6 --ignore-quals --norc -x reference/Bisulfite_Genome/CT_conversion/BS_CT test.fastq_C_to_T.fastq"
FFSP241:66:BPN80002-2834:1:1103:4490:2770_1:N:0:5 0 ref_CT_converted 187 42 295M * 0 0 TAAGGTGGGTAATTTTTAAAGTTAGGTGGAAATGTTTGGTTTGAATTGTATATATGTTAGGGTGGAGAGTGAGAATGAATGTAAAGAGAGGTTAGTGTTGGGGTGGGGTTGTTGGGGTAAAGGGATAGGTTTGGGGGAGGTTTGGGTTGTTATGGTTAGAAAGTATGGAAAGAGTTTAAGTAAAATAGTTGTGTGTTGGTAGGGTGTGGTGGTTTATGTTTGTAATTTTAGTATTTTGGAAGGTTGAGGTGGGTGGATGATTTTAGGTTGGGAGTTTTAGATTAGTTTGATTAAT :FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFFFFFFF:FF:F:FF,FFFFFFFFF,FF:FFFFFFF:F:FFFFFFF,F,F,:FFF:F,F::FFFFFF:,FF,:FF,,F:F,:FFFF:F:,F,:::,:FF,,,::FFF:F:FFF::,FF:FFF::F:: AS:i:-12 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:239G23G31 YT:Z:UU
FFSP241:66:BPN80002-2834:1:1114:8200:1450_1:N:0:5 0 ref_CT_converted 187 40 295M * 0 0 TAAGGTGGGTAATTTTTAAAGTTAGGTGGAAATGTTTGGTTTGAATTGTATATATGTTAGGGTGGAGAGTGAGAATGAATGTAAAGAGAGGTTAGTGTTGGGGTGGGGTTGTTGGGGTAAAGGGATAGGTTTGGGGGAGGTTTGGGTTGTTATGGTTAGAAAGTATGGAAAGAGTTTAAGTAAAATAGTTGTGTGTTGGTAGGGTATGGTGGTTTATGTTTGTAATTTTAGTATTTTGGAAGGTTGAGTTGGGTGGATGATTTAAGTTTGGGAGTTTTAAATTAATTTAATTAAT FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFF:F:FF:FFFF::FFF,FFFFF:FFFF,,FF:FFFFFFFFFF,FFFFFF:,::FFFF,,:FF,FF,::,:::FFFF:FF,F:::,F:FF,FFFF,,FFF,,F,F,:,,FF,F,,::,: AS:i:-48 XN:i:0 XM:i:8 XO:i:0 XG:i:0 NM:i:8 MD:Z:205G33G8G14G2G12G4G3G6YT:Z:UU
I hope this resolves it?
Hi Felix,
I have two questions:
Thanks, -Ar
1) The CIGAR string doesn't say that match is perfect, it says the match is linear, i.e. there are no insertions or deletions. You need to look at the MD:Z:
string to see mismatches, the alignment score (AS:i:0
would be a perfect match), and maybe the hamming distance (NM I believe) if you want more detailed informtion.
2) I am not sure if there is a generally accepted definition of "ideal". I personally tend to use Bismark with its stringent default scoring but without any subsequent MAPQ filtering (assuming that the mapping takes care of worst multi-mapping reads alrteady).
Thanks, @FelixKrueger. Appreciate it.
Hi @FelixKrueger,
I went through the MD:Z string, and I am not able to understand what does the following string means:
MD:Z:0C11C0C2C4C0C12C9C0C1C1C5C18C5C9C3C12C2C0C4C7C3C0C9C0C4C2C0C12C10C13C10C5G0C5C1C1C1C0C6C1C2C1C5G3C0C3G11C0C1G2G1C7C0C1G1C0C1G0C0C1G1C0C2C0
It is the 13th column of the aligned bam file. Thanks!
This is an auxiliary flag, you can read up on it in the SAM specification if you are interested (https://samtools.github.io/hts-specs/SAMtags.pdf).
More specifically:
MD:Z:[0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)*
String encoding mismatched and deleted reference bases, used in conjunction with the CIGAR and
SEQ fields to reconstruct the bases of the reference sequence interval to which the alignment has been
mapped. This can enable variant calling without requiring access to the entire original reference.
The MD string consists of the following items, concatenated without additional delimiter characters:
• [0-9]+, indicating a run of reference bases that are identical to the corresponding SEQ bases;
• [A-Z], identifying a single reference base that differs from the SEQ base aligned at that position;
• \^[A-Z]+, identifying a run of reference bases that have been deleted in the alignment.
As shown in the complete regular expression above, numbers alternate with the other items. Thus if two
mismatches or deletions are adjacent without a run of identical bases between them, a ‘0’ (indicating
a 0-length run) must be used to separate them in the MD string.
Clipping, padding, reference skips, and insertions (‘H’, ‘S’, ‘P’, ‘N’, and ‘I’ CIGAR operations) are not
represented in the MD string. When reconstructing the reference sequence, inserted and soft-clipped
SEQ bases are omitted as determined by tracking ‘I’ and ‘S’ operations in the CIGAR string. (If the
CIGAR string contains ‘N’ operations, then the corresponding skipped parts of the reference sequence
cannot be reconstructed.)
For example, a string ‘10A5^AC6’ means from the leftmost reference base in the alignment, there are
10 matches followed by an A on the reference which is different from the aligned read base; the next 5
reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC;
the last 6 bases are matches
Hi Felix,
I hope you are doing well. I am using the bismark v0.22.3. I have a question regarding the MAPQ score for the perfectly mapped reads. The MPAQ for one read is 40, and the other is 0, even though both have a high Phred score and have an exact mapping cigar sequence of 295M. Can you please elaborate on why this is the reason? Thanks. Appreciate it.
Here are the fastq reads:
Here is the reference reads is:
>ref NNNNNNNNNNCTGGAGAGGAAGGCAACTGAGGGCTGCGGGAGGTGAGGAGGAGGCGTGGACGCCAGGAGTTGTTTTGCTCACACGCCCCGCGGCGTCCCGCCTAGGTGAGCTCCGGACCCTCCCGCCCCACCCCAGCGGGGCTGCTCGACTCCGTCGTGCCCCCTTGGGCTTCCACTGCAGGAATGCAAGGCGGGTAACCTTCAAAGCCAGGTGGAAACGTCCGGTTTGAACCGCACATATGCTAGGGTGGAGAGTGAGAACGAATGCAAAGAGAGGCTAGCGTTGGGGTGGGGCTGCCGGGGCAAAGGGACAGGCCCGGGGGAGGCCCGGGCCGCCACGGTTAGAAAGCATGGAAAGAGCTTAAGTAAAATAGCTGCGTGTCGGCAGGGCGCGGTGGCTCACGCCTGTAATCTCAGCACTTTGGGAGGCCGAGGCGGGCGGATGACCTGAGGTCGGGAGTTCCAGACCAGCCTGACCAACATGGAGNNNNNNNNNN
and here is the aligned sequence:
Best, -Ar