comprna / METEORE

Automatic DNA methylation detection from nanopore tools and their consensus model
MIT License
72 stars 18 forks source link

weighted vs. unweighted average #14

Closed meydaria closed 3 years ago

meydaria commented 3 years ago

Hi,

I have a question on how you calculate the mean after accumulating the CpG sites, for example in the file run_megalodon.R: df <- df[,list(Methyl_freq = mean(Methyl_freq), Cov = sum(Cov)), list(Chr,Pos)] Can you maybe explain why you use just the mean without weighting it based on the amount of reads that support the different methylation frequencies? I think I would rather use the weighted mean (giving more weight to a methylation frequency which is supported by a higher coverage)? Like in the following example: df <- df[,list(Methyl_freq = sum(Methyl_freq*Cov)/sum(Cov), Cov = sum(Cov)), list(Chr,Pos)] Or am I missing some reason for which it is better to use the unweighted mean?

zakayuen commented 3 years ago

Hi @meydaria,

What this df <- df[,list(Methyl_freq = mean(Methyl_freq), Cov = sum(Cov)), list(Chr,Pos)] does is to combine the methylation frequencies from both strands for the same CpG site by adding up the coverage and averaging the methylation frequencies, for instance,

Chr  Pos   Strand  Coverage  Methylation
1   1503  +   10   1.0
1   1504  -   23   0.8

After combining the site that has methylation evidence from both strand:

Chr  Pos  Coverage  Methylation
1   1503   33   0.9

Here we provide the pipelines for the users to generate aggregated modified base outputs in an augmented BED format similar to the bedMethyl format. One of these output stores the per-site prediction data:

The other one also stores the per-site data in the same augmented BED format but the methylation calls from both strands are merged into a single strand, as I mentioned above.

The methylation frequency per site (i.e. at genome level) is calculated by the number of mapped reads predicted as methylated over the number of total mapped reads. We don't consider the weighted or unweighted mean to calculate the methylation frequency.

Hope this helps.

Zaka

meydaria commented 3 years ago

Hi @zakayuen,

thanks for your reply! That doesn't quite answer my question, perhaps I phrased it too imprecisely.

I meant the step when you combine the methylation information for the two strands. As in your example, when you have at one position a methylation frequency of 1.0 on the plus strand, covered by 10 reads, and a methylation frequency of 0.8 on the minus strand, covered by 23 reads. When you combine the information for these two strands, and calculate the average as you said, you are now calculating the unweighted average for combining the methylation frequency values for both strands. So you calculate the new methylation frequency with (1.0 + 0.8) /2 = 0.9, right? But instead you could also calculate the weighted average, to take into account that a different amount of reads supports the different methylation frequencies on the strands. For this example it would be (1.0*10+0.8*23)/33 = 0.86. So a similar, but slightly different result.

I was just wondering, whether there is a reason, why you are not using this weighted average rather than the unweighted.

Best, Daria

cameron-jack commented 3 years ago

Hi Daria,

You could definitely do that, it's a good idea. There are two situations you need to consider though

  1. most sites will have both strands methylated or unmethylated, so minor variations score aren't likely to be strong influencing factors. This may matter more for categorical predictive mechanisms such as Random Forest, where you have discrete cutoff levels.
  2. You have methylation on just one strand. Averaging (weighted or not) makes no sense and will only ruin your actual signal. Combining strand information is simply the wrong approach in this case.

You can make the modification you suggest and test it on the controls. If you get an improvement, then please let us know and send us a Pull Request to incorporate your suggestion!

Thanks, Cam