Large discrepancy between genome coverage and CpG coverage

FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states

http://felixkrueger.github.io/Bismark/

GNU General Public License v3.0

386 stars 101 forks source link

Large discrepancy between genome coverage and CpG coverage #338

Closed nashchem closed 4 years ago

nashchem commented 4 years ago

Hi @FelixKrueger,

This might be a naive question. I am observing large discrepancies between genome coverage and CpG coverage.

For example, after deduplication step, total number of reads = 37207579 which roughly translates to a genome coverage of = 300X37207579/1210000000 = 9.22 (2X150 paired end read length, 1.21Gb Chicken genome size and 37207579 deduplicated reads count).

For the same library, from the Bismark coverage file (*deduplicated.bismark.cov.gz), CpG coverage = cumulative sum of meth and unmeth reads across all CpGs / total CpGs in the coverage file = 5.11.

Why do you think there is large difference in genome coverage of 9.22 vs 5.11 CpG coverage.

Thank you, Best, Naresh D J

FelixKrueger commented 4 years ago

Hi Naresh,

I am fairly certain that your theoretical genome coverage of 9.22 is - while being theoretically achievable - overly idealistic. A far more likely scenario is that your paired-end library has a variety of fragment lengths that are much shorter than 300bp.

In bisulfite libraries one often sees libraries with fragment lengths between 70 and 200bp, peaking at ~120bp. This would nicely explain why you are only getting half of the idealistic genome coverage.

This can be tested very quickly though. You could for example import your BAM file into SeqMonk (set the MAPQ filter to 0 to include all reads), and then click on Read Length histogram. This shouldn't take more than a minute, and will (hopefully) confirm my theory. If so, it proves the point: 2 x 150bp does not equal 300bp in a bisulfite setting, as fragments are typically much shorter than that (on top of that, calls from overlapping Read 2s are discarded during the methylation extraction).

nashchem commented 4 years ago

Hi @FelixKrueger,

Thank you for the quick response. yes, you are right that my average read length is around 170-180 bp from the distribution. I used the option to consider overlapping reads with "--include_overlap", even with this option calls from overlapping Read 2s are discarded during the methylation extraction? One last question, is it normal or any issues with library? read_length_distribution

FelixKrueger commented 4 years ago

The last question first: this is expected, and looks absolutely fine.

The option --include_overlaps should include more calls (just look at the splitting report and compare it to the default mode), but I would recommend you don't do that so you don't include PCR amplified calls and artificially inflate the coverage depth. Does that make sense?

nashchem commented 4 years ago

Thank you @FelixKrueger, it does make sense.