Closed meydaria closed 3 years ago
Hi @meydaria,
What this df <- df[,list(Methyl_freq = mean(Methyl_freq), Cov = sum(Cov)), list(Chr,Pos)]
does is to combine the methylation frequencies from both strands for the same CpG site by adding up the coverage and averaging the methylation frequencies, for instance,
Chr Pos Strand Coverage Methylation
1 1503 + 10 1.0
1 1504 - 23 0.8
After combining the site that has methylation evidence from both strand:
Chr Pos Coverage Methylation
1 1503 33 0.9
Here we provide the pipelines for the users to generate aggregated modified base outputs in an augmented BED format similar to the bedMethyl format. One of these output stores the per-site prediction data:
The other one also stores the per-site data in the same augmented BED format but the methylation calls from both strands are merged into a single strand, as I mentioned above.
The methylation frequency per site (i.e. at genome level) is calculated by the number of mapped reads predicted as methylated over the number of total mapped reads. We don't consider the weighted or unweighted mean to calculate the methylation frequency.
Hope this helps.
Zaka
Hi @zakayuen,
thanks for your reply! That doesn't quite answer my question, perhaps I phrased it too imprecisely.
I meant the step when you combine the methylation information for the two strands. As in your example, when you have at one position a methylation frequency of 1.0 on the plus strand, covered by 10 reads, and a methylation frequency of 0.8 on the minus strand, covered by 23 reads. When you combine the information for these two strands, and calculate the average as you said, you are now calculating the unweighted average for combining the methylation frequency values for both strands. So you calculate the new methylation frequency with (1.0 + 0.8) /2 = 0.9
, right? But instead you could also calculate the weighted average, to take into account that a different amount of reads supports the different methylation frequencies on the strands. For this example it would be (1.0*10+0.8*23)/33 = 0.86
. So a similar, but slightly different result.
I was just wondering, whether there is a reason, why you are not using this weighted average rather than the unweighted.
Best, Daria
Hi Daria,
You could definitely do that, it's a good idea. There are two situations you need to consider though
You can make the modification you suggest and test it on the controls. If you get an improvement, then please let us know and send us a Pull Request to incorporate your suggestion!
Thanks, Cam
Hi,
I have a question on how you calculate the mean after accumulating the CpG sites, for example in the file run_megalodon.R:
df <- df[,list(Methyl_freq = mean(Methyl_freq), Cov = sum(Cov)), list(Chr,Pos)]
Can you maybe explain why you use just the mean without weighting it based on the amount of reads that support the different methylation frequencies? I think I would rather use the weighted mean (giving more weight to a methylation frequency which is supported by a higher coverage)? Like in the following example:df <- df[,list(Methyl_freq = sum(Methyl_freq*Cov)/sum(Cov), Cov = sum(Cov)), list(Chr,Pos)]
Or am I missing some reason for which it is better to use the unweighted mean?