FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
385 stars 101 forks source link

Clarification about uniquely mapping and ambiguous mapping report #336

Closed varunorama closed 4 years ago

varunorama commented 4 years ago

Hello Felix,

I wanted clarification regarding the mapping efficiency report that is produced after Bismark. For the sake of transparency, here is the full report for one of my samples. They are two lanes of single-end RRBS sequencing per sample; both lanes for the same sample were considered when running bismark.

Bismark report for: /home/vbdw222/scratch/RRBS/fastq_trimmed/Day0_S85_L003_R1_001_trimmed.fq (version: v0.22.1) Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e. not performed) Bismark was run with Bowtie 2 against the bisulfite genome of /scratch/vbdw222/chromosomal_assembly/ with the specified options: -q -N 1 --score-min L,0,-0.4 -p 4 --reorder --ignore-quals

Final Alignment report

Sequences analysed in total: 15716632 Number of alignments with a unique best hit from the different alignments: 7879863 Mapping efficiency: 50.1% Sequences with no alignments under any condition: 2790057 Sequences did not map uniquely: 5046712 Sequences which were discarded because genomic sequence could not be extracted: 25

Number of sequences with unique best (first) alignment came from the bowtie output: CT/CT: 3966715 ((converted) top strand) CT/GA: 3913123 ((converted) bottom strand) GA/CT: 0 (complementary to (converted) top strand) GA/GA: 0 (complementary to (converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

Final Cytosine Methylation Report

Total number of C's analysed: 158286083

Total methylated C's in CpG context: 25205902 Total methylated C's in CHG context: 501854 Total methylated C's in CHH context: 880084 Total methylated C's in Unknown context: 16317

Total unmethylated C's in CpG context: 3100528 Total unmethylated C's in CHG context: 33058205 Total unmethylated C's in CHH context: 95539510 Total unmethylated C's in Unknown context: 178283

C methylated in CpG context: 89.0% C methylated in CHG context: 1.5% C methylated in CHH context: 0.9% C methylated in Unknown context (CN or CHN): 8.4%

Bismark completed in 0d 3h 28m 2s

My questions are:

1) When producing the mapped file, are those sequencing that are not mapped uniquely still considered and mapped at multiple locations in bam file produced, or are they not considered and removed? I know that you can use the -ambig_bam to produce a file showing the ambigously mapped sequenced, but did not know if the ambiguous reads are removed inherently since it's hard to tell where they should actually map.

2) The mapping efficiency reported (50%) seems to be the reads that are uniquely mapped. However looking at other mapping efficiency reports, some studies report the mapping efficiency as the sum of the uniquely mapped % + multiple/ambiguously mapped % (which in my case would be closer to 82%). Do you have any thoughts on which may be more appropriate to report?

FelixKrueger commented 4 years ago

Hi @varunorama

To 1: Reads that align ambiguously are removed and will not appear in the final BAM file. Indeed the rationale for removing those reads is that we do not want to infer methylation levels for reads where we are not sure where they came from.

To 2: I suppose this is really a matter of taste. It is true that the overall mapping rate in your case is ~80%, which also means that your sample is probably quite pure and doesn't contain any notable levels of contamination. In terms of actually usable data the 50% would be more accurate, as the other 30% align ambiguously and are hence not getting reported (see 1:).

Does this help?

varunorama commented 4 years ago

Hello again Felix,

This absolutely clears up my questions, thank you!

I do have one last question that arises from this information. Our group is wondering how to extract coverage information from the multimapped regions (our genome is extremely repetitive so many of these sequences may give us biological insight in our model's genome). We think that the best way to go about it would be to run the bismark_methylation_extractor program on the ambiguously_mapped.bam file that is produced to estimate potential coverage of these mapped regions. Would you recommend this method or do you have a more recommended method in understanding coverage of the multimapped regions?

FelixKrueger commented 4 years ago

Hi @varunorama

The ambiguous BAM file will only produce a single mapping position for each ambiguously mapping read, but this will be directly in Bowtie 2 BAM format (or HISAT2 if you used that). This means that it does not contain the Bismark specific methylation call information (and as I explained further up in this thread we don't want to be doing this because it would then be used by everyone (and we don't trust it...)).

You could still find out where in the genome the reads came from etc, just without doing the methylation part.... (which is what I would understand is coverage of multimapped regions).

varunorama commented 4 years ago

Got it, thank you for clarifying Felix. Yes, it makes sense. And apologies, I was conflating coverage with location of the genome. Thanks again!