FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
386 stars 101 forks source link

Bismark's terminal output alignment stats #289

Closed rauldiul closed 4 years ago

rauldiul commented 4 years ago

Dear Felix,

I have a question regarding the output that Bismark returns in the terminal, which gives details on the numbers of concordant/discordant pairs, and which does not appear in the outputted report file.

I'm running Bismark for paired-end files with the following code:

bismark --score_min L,0,-0.4 /home/rtejedor/Documents/Learning/oxWGBS/oxWGBS_genome -1 JC-BS_R1_first_250k.fq -2 JC-BS_R2_first_250k.fq

The output displays the default Bowtie 2 options selected, including --no-discordant:

Summary of all aligner options: -q --score-min L,0,-0.4 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500

In the terminal output, Bismark shows the stats for Alignment rate and concordant/discordant mates (this is the part that is not included in the final report):

Reading in the sequence files JC-BS_R1_first_250k.fq and JC-BS_R2_first_250k.fq
Chromosomal sequence could not be extracted for A00251:137:HT3F2DSXX:1:1101:32117:2660_1:N:0:CTAGCT MT  16149
Chromosomal sequence could not be extracted for A00251:137:HT3F2DSXX:1:1101:29785:11772_1:N:0:CTAGCT    MT  16166

250000 reads; of these:
  250000 (100.00%) were paired; of these:
    135565 (54.23%) aligned concordantly 0 times
    76841 (30.74%) aligned concordantly exactly 1 time
    37594 (15.04%) aligned concordantly >1 times
45.77% overall alignment rate
250000 reads; of these:
  250000 (100.00%) were paired; of these:
    135540 (54.22%) aligned concordantly 0 times
    77293 (30.92%) aligned concordantly exactly 1 time
    37167 (14.87%) aligned concordantly >1 times
45.78% overall alignment rate
Processed 250000 sequences in total

And it finally shows the Mapping efficiency:

Final Alignment report
Sequence pairs analysed in total:   250000
Number of paired-end alignments with a unique best hit: 172537
Mapping efficiency: 69.0%

Sequence pairs with no alignments under any condition:  58479
Sequence pairs did not map uniquely:    18984
Sequence pairs which were discarded because genomic sequence could not be extracted:    2

Number of sequence pairs with unique best (first) alignment came from the bowtie output:
CT/GA/CT:   87464   ((converted) top strand)
GA/CT/CT:   0   (complementary to (converted) top strand)
GA/CT/GA:   0   (complementary to (converted) bottom strand)
CT/GA/GA:   85071   ((converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

My understanding is that the --no-discordant argument will make Bowtie 2 "not look" for alignment for the reads belonging to discordant mates. Thus, looking at the output, what I do not understand is,

I'm sure I'm missing something obvious here, and maybe this is a Bowtie 2 question,

but thanks for your help

FelixKrueger commented 4 years ago

Hi @rauldiul

The answer to your questions is actually fairly simple: the seemingly duplicated alignment stats that appear on screen are the STDERR output of the two alignment threads that are being run by Bismark in parallel. Let's assume that

250000 reads; of these:
  250000 (100.00%) were paired; of these:
    135565 (54.23%) aligned concordantly 0 times
    76841 (30.74%) aligned concordantly exactly 1 time
    37594 (15.04%) aligned concordantly >1 times
45.77% overall alignment rate

is the OT (original top) strand, and

250000 reads; of these:
  250000 (100.00%) were paired; of these:
    135540 (54.22%) aligned concordantly 0 times
    77293 (30.92%) aligned concordantly exactly 1 time
    37167 (14.87%) aligned concordantly >1 times
45.78% overall alignment rate
Processed 250000 sequences in total

is the OB (original bottom) strand (it could be the other way round but that doesn't matter here). As you can see, both of these alignments produce ~31% uniquely mapping alignments, and ~15% potentially multi-mapping alignments. Discordant reads do not even make it into this report, as they are discarded straight away.

What Bismark then does internally is to figure out whether sequences can be assigned uniquely to either the top or bottom strands. Only then are reads reported in the Bismark output and count toward the Bismark report. So 62% (2* 31%) of unique top or bottom strand alignments, and a fraction of the multimapping reads that have a best alignment to either the top or bottom strand, together result in an overall Bismark mapping efficiency of ~69%. Does that make sense?

69% mapping is not bad, but whether or not this is already the maximum you can achieve depends on several factors. Among them,

We have tried to list a few more details here: https://github.com/FelixKrueger/Bismark/blob/master/Docs/FAQ.md#issue-2-low-mapping-effiency-of-paired-end-bisulfite-seq-sample

Let me know you still have any questions.

rauldiul commented 4 years ago

Dear Felix,

Thank you for the quick response. Indeed it was a quite obvious issue... and you have cleared my doubts.

Yes, I'm OK with the 69 % efficiency, these are 150bp PE reads, R1, R2 similar and good quality, and trimmed, but they are mouse genome. So I'll play with --score-min and such, but the efficiency seems to be normal.

thanks for your help

FelixKrueger commented 4 years ago

Excellent, glad I could help.