FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
366 stars 101 forks source link

Spiky position in the M-Bias plot #673

Open neolithlee opened 2 weeks ago

neolithlee commented 2 weeks ago

bismark_mbias-CpG_R1 bismark_mbias-CpG_R2

The data I used was processed by fastqc and Trim_galore. And it is processed by bismark, deduplicate_bismark and bismark_methylation_extractor as specified in the manual.

As can be seen in the M-bias plot(from multiQC), the methylation level of read 1 appears to be stable. However, methylated read 2 produces some peaks. May I ask why this variant appears in the Reading 2 plot?

FelixKrueger commented 2 weeks ago

Such spikes in M-bias plots (of sometimes also GC content plots etc) are typically caused by individual sequences that are highly overrepresented, and have a certain methylation state. You could try to identify the particular sequence via various means, the easiest probably being looking for isolated loci with a very high number of mapping reads. You could also try to see how many calls there are at this position (not sure you can do this in the MultiQC report, but you could look at the equivalent Bismark_report.html). In all likelihood such minor blips won't affect your downstream analysis overall, but are likely some very localised effects (just my gut feeling at this point).

neolithlee commented 2 weeks ago

Thanks for your reply.

As you said, some of the spikes seem to be related to the number of reads. In the case of the largest spike, the average Qscore appears to be lower than other areas, so I will check whether there is an experimental problem.

M-bias

FelixKrueger commented 2 weeks ago

These things something seem present themselves problematic in more than one of the FastQC modules. There could for example have been a technical issue with the flowcell (which you might see in the per-tile plot), such as an air bubble, or a higher call of N at the position, or a very high number of a specific call (e.g. G when the signal from the dyes wasn't high enough), or indeed it there is a very high prevalence of a certain base because of an overrepresentation of a certain (repetitve?) sequence that will in turn down-adjust quality scores and the like. But given that it manifests itself in the M-bias plot, it has to come from a sequence that is mappable, which already narrows it down substantially. Happy sleuthing!