FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
394 stars 103 forks source link

Help with interpretation of M-bias plot #701

Closed xinyun-stacy-li closed 1 month ago

xinyun-stacy-li commented 2 months ago

Thanks for creating and actively maintaining this amazing tool! I'm a first-time user of Bismark. I'm running this pipeline starting from FastQC, Trim Galore, Bismark aligner, and Bismark methylation extractor. The libraries I'm working on is NEB EM-seq paired-end sequenced samples. I trimmed 10bp from 5' and 3' end for both read 1 and 2 according to your recommendation here (https://felixkrueger.github.io/Bismark/bismark/library_types/#em-seq-neb). The M-bias plots generated from one of the samples look very weird, especially for read 2 where the CpG methylation starts from roughly 40% and wiggles for the first half of the read length. I'm also confused about the reason why CpG total calls in read 1 starts to gradually drop even in the middle of the read length. I wonder if you could help me interpret the plots and suggest potential fixations. Thank you so much!

Screenshot 2024-09-26 at 10 58 21 Screenshot 2024-09-26 at 10 58 33
FelixKrueger commented 1 month ago

Tackling the question about decreasing levels of total calls first: the (rather moderate) drop in total calls for Read 1 are almost certainly the effect of adapter, and/or quality trimming of the reads. There will be a certain fraction of the sequenced reads are simply shorter than the read length, hence there is a drop with increased read length. This is also true for Read 2, but in addition the overlap detection and removal of redundant portions of the read means the observed decline is more pronounced.

Regarding the drop in methylation levels for Read 2: I'm afraid I don't have an explanation for that, it is certainly now what you would expect for EM-seq. Also, I can't think of a biological reason why the first half of all Read 2 should have a lower methylation level. My guess is it that something odd happened at the start of Read2 (on a technical level), maybe it is worth asking the genomics hub if there were any mishaps on the instrument?

Since lower methylation for Read 2 is encoded by A (instead of Gs), you could potentially add AAAAAAAAAA as a 'contamination` for FastQC (under configuration, adapters.txt or something similar). This might show whether there was an unusual content of PolyA at the start of Read 2. It might also be obvious in the sequence composition plot of FastQ.

I would probably recommend to use only R1 (single-end) alignments for this sample to ensure that you don't include any technical artefacts.

xinyun-stacy-li commented 1 month ago

Thank you so much for your comment. I did reach out to the sequencing center and they plan to re-sequence the samples. Hopefully that will resolve the issue. Thanks again!