bioinfomaticsCSU / deepsignal

Detecting methylation using signal-level features from Nanopore sequencing reads
GNU General Public License v3.0
109 stars 21 forks source link

How to calculate methylation frequencies from predict probability #46

Closed PanZiwei closed 4 years ago

PanZiwei commented 4 years ago

Hi, I have a question about the correlation analysis between methylation frequencies of CpGs calculated by DeepSignal/ nanopolish with those from bisulfite sequencing.

When you calculate the Pearson coefficient, how do you calculate the methylation frequencies in the DeepSignal model at genome level? I think your model provides the prediction result of methylated probability P+’ and the unmethylated probability P- ’ of each tested site in the genome, but not the methylation frequencies. So do you hypothesize that methylated probability P+’ is equal to methylation frequencies 5mC% at each target site?

How about Nanopolish? Since Nanopolish uses the log-likelihood ratio to make a methylation call for each site, how do you convert the ratio into methylation frequencies to make it compatible to BS-seq?

Thank you so much for your help!

PengNi commented 4 years ago

Hi @PanZiwei ,

To calculate methylation frequencies of CpGs at genome level, first the methylation status of the CpG at each read should be labeled as 0 or 1. Then we just count the number of 1s and the total number of reads to calculate the frequency.

To label the CpG at each read, deepsignal compares P+' and P-'. And nanopolish sees if the log-likelihood greater or less than 0 or +/- 2.5 in their paper.

we have uploaded a script(call_modification_frequency.py). I think nanopolish also has their own script to calculate frequencies.

Best, Peng

PanZiwei commented 4 years ago

Hi Peng, Thank you so much for your reply. Just want to make sure we are on the same page. So any point in the heatmap of the correlation(Figure 2), let's say the point is (x,y), so the point stands for a specific CpG site, where x = methylation frequency at the CpG site calculated from BS-seq, b = reads number labelled as 1/total reads mapped to the CpG site?

Thank you!

PengNi commented 4 years ago

Yes, that's right for calculating correlation. Also, in our test, only the sites with at least 5 mapped reads are used for the comparison.

Best, Peng