FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
382 stars 101 forks source link

Bismark methylation extractor's bedGraph output returning ~50 million CpGs for WGBS? #666

Closed JanmajaySingh closed 5 months ago

JanmajaySingh commented 5 months ago

Hi Felix,

I preprocessed some paired-end sequenced WGBS data following standard guidelines mentioned here.

My pipeline looks like this:

trim_galore --cores <CORES> --gzip --paired <read1> <read2>

bismark --genome "path/to/hg38" -q --parallel <CORES> --gzip -1 <sample_id>_R1_val_1.fq.gz -2 <sample_id>_R2_val_2.fq.gz

deduplicate_bismark --bam -p <sample_id>_R1_val_1_bismark_bt2_pe.bam

bismark_methylation_extractor --paired-end --no_overlap --ignore_r2 2 --gzip --multicore <CORES> --bedGraph --zero_based <sample_id>_R1_val_1_bismark_bt2_pe.deduplicated.bam

I'm using

all installed via bioconda.

In my understanding the --bedGraph option should consider only CpG methylation context by default.

But the outputted bedGraph and coverage files have close to 50 million CpGs (for chr 1-22) whereas I expected the number to be <27.8 million (calculated from hg38 reference).

Is Bismark returning methylation calls for other contexts as well? From some checking against hg38, it didn't look like this was the case but any help would be greatly appreciated!

FelixKrueger commented 5 months ago

Bismark reports the methylation as single cytosines, whereas the figure 27.8M represents CpG di-nucleotides, or dyads.

You could further process the coverage output to merge together top and bottom strand Cs using the option --merged_CpG for coverage2cytosine (see https://felixkrueger.github.io/Bismark/bismark/methylation_extraction/#optional-genome-wide-cytosine-report-output)

JanmajaySingh commented 5 months ago

Just to be clear, do you mean the outputted bedGraph contains methylation info for cytosines at all methylation contexts? If so, I am confused by the line "By default, only cytosines in CpG context are sorted." here.

Do the outputted bedGraph and coverage files have cytosines merged from both strands as well?

FelixKrueger commented 5 months ago

Both bedGraph and coverage files contain CpG context only, unless you selected --CX. Both files have single base resolution, so are not merged. If you get a call at fictitious positions 1 and 2, and 5 and 6, then these will be Cs in CpG context on the top (1 and 5) or bottom strand (2 and 6), respectively.

Both files are sorted from low to high position, per chromosome. Does that make it clear?

JanmajaySingh commented 5 months ago

that is very clear, thank you!

JanmajaySingh commented 5 months ago

Hey Felix, I had a follow up question. Is it standard practice to merge CpG info from both strands?

FelixKrueger commented 5 months ago

I am not sure if one can say common practice, it really depends on what you want to do and how you approach that question. We personally never did it, but for certain types of statistical methods it arguably reduces the multiple testing burden if you reduce the number of observations by half...

JanmajaySingh commented 5 months ago

Yeah that makes sense! While processing different WGBS datasets, it has been hard to guess if such merging was performed in the original study. Some of these studies (like GTEx) share their preprocessed bedGraphs with info from only one strand. But that data also clearly aggregated methylation from multiple samples of the same tissue. So it hasn't been possible to guess if across-strand merging was also performed.

In any case, thank you for your quick responses and help! Your (and Babraham's) varied resources online have been super helpful in my bioinformatics journey!

FelixKrueger commented 5 months ago

I'm glad it helped!

Maybe the easiest 'giveaway' whether strand merging was performed or not is to look at the size of a CpG. Single-base resolution will look like this:

start    end
  0       1     (bedGraph, 0-based half-open)
  1       1     (coverage file)

whereas a CpG dyad looks like this:

start    end
  0       2     (bedGraph, 0-based half-open)
  1       2     (coverage file)
JanmajaySingh commented 5 months ago

Yes, I got that after making the merged files this morning! Thanks!