chrisamiller / readDepth

R package for inferring copy number from read depth
Other
30 stars 11 forks source link

Methylation Correction: Non-Covered Sites #16

Closed kenji-yt closed 3 weeks ago

kenji-yt commented 1 month ago

Hi,

No issue with the code but a question regarding the methylation correction.

I would like to re-implement your approach with the input being a Bismark methylation call file giving the following information: % of methylation for each covered cytosines.

I don't know how these differ from the input methylation files in your method but assuming the information is the same: My understanding is that GC content in a bin is corrected such that GC content = number of GC sites / 2 * methylation % at each GC site.

I would like to know how you deal with GC sites for which there is 0 coverage. Are they counted as fully methylated or fully un-methylated.

Thank you in advance.

chrisamiller commented 1 month ago

I would like to re-implement your approach with the input being a Bismark methylation call file giving the following information: % of methylation for each covered cytosines.

This is the function in question. https://github.com/chrisamiller/readDepth/blob/master/R/rd.gccorrect.R#L49

I agree that it is under-documented (and I probably wouldn't write it this way if I was creating the package today!) That said, here's what I believe is needed:

Inside the annotations directory, create a folder called "methMap" and split your methylation tracks by chromosome and gzip them, so that tthey can be read in by the following line:

filename = paste("annotations/methMap/map.",chr,".gz",sep="")

It expects that a) there is a header on that file (it strips the first line), and that column 1 contains the position (i.e. 123456), column 4 contains the count of methylated bases, and column 6 contains the depth of coverage at that base. Later, it calculates the methylation percentage as column4/column6.

I would like to know how you deal with GC sites for which there is 0 coverage. Are they counted as fully methylated or fully un-methylated.

The default assumption is that all Cs are converted to U by the bisulfite treatment (and read out as T), so GC content is halved:

gcBases <- readBin$gc/2 * effectiveLength

The methylated/protected bases are then added back in at the end to adjust that gc content for the bin.

  for(i in 1:length(matches[,1])){
    gcBases[matches[i,1]] = gcBases[matches[i,1]] + methProp[matches[i,2]]
  }
  readBin$gc = gcBases/effectiveLength

Hope this helps!

kenji-yt commented 3 weeks ago

Very helpful, thanks!

I tried plotting read count vs GC in methylated data with and without the GC correction and the relationship seems weaker with the correction:

Screenshot 2024-06-13 at 20 45 54

I used read counts per 1kb bins generated with: bedtools coverage -a < bin range > -b < sorted bam file > I then used bismark methylation extraction bed file and corrected GC in each bin with that methylation information as per your method.

The data came from methylated and unmethylated Arabidopsis plants (I tried different species with the same result). Unmethylated data appears to show the expected uni-modal distribution found by Speed and Benjamini (2012) but methylated data shows a negative relationship and correction does not recover the unimodal relationship in my case.

Perhaps there is something wrong with how GC, methylation or coverage is counted in my case. Did you investigate how this transformation affects the relationship between GC and coverage? If not, perhaps you are aware of studies on this topic?

Reference:

chrisamiller commented 3 weeks ago

I agree that looks wonky, and I'm not sure that I'd trust it. When we wrote this, there was very little WGBS data to test on, so it wouldn't surprise me if the approach needs to be somewhat altered. Would be interested to hear if you find a method that corrects more cleanly!

kenji-yt commented 3 weeks ago

I see. Thanks for the feedback!