WGLab / DeepMod2

DNA 5mC methylation detection from Dorado or Guppy basecalled Oxford Nanopore reads
MIT License
32 stars 2 forks source link

methylation calling in non-CpG context (CHG and CHH) #6

Open yusmiatiliau opened 1 year ago

yusmiatiliau commented 1 year ago

Hello there, I re-post an issue I posted in DeepMod regarding future plan to include detection of other methylation motives in plant. Hope it can be one of the new development included in DeepMod2. Thanks a lot

kaichop commented 1 year ago

This is in principle very possible, as long as there is a good training data with paired gold standard on methylation within CHG and CHH motif. If you are aware of such plant specific datasets, we can train a model for that for detection of non-CpG context. Thank you.

yusmiatiliau commented 1 year ago

Hi @kaichop,

Thank you very much for your response. I need to talk to my supervisor, but we should have data from nanopore sequencing and whole genome bisulfite sequencing, would those be enough? Also, our data are all quite recent, so from R10 flowcell, would that be compatible with DeepMod2? Thanks again

kaichop commented 1 year ago

DeepMod2 can train on R10 flowcell but the model will be different from those on R9.4. We have tested it on HG002 (on R10) and it works well.

umahsn commented 1 year ago

Just to add some extra information, yes whole genome bisulfite sequencing and Nanopore sequencing should be sufficient. We have uploaded two CpG models trained from high coverage Guppy basecalled R10.4 and R9.4.1 reads from ONT open datasets release using one BS-seq replicate, and we have achieved very high performance.

yusmiatiliau commented 1 year ago

Thanks both. I'll get back to you guys regarding the datasets for training.

yusmiatiliau commented 1 year ago

Hi @kaichop and @umahsn,

Sorry for the delay, we apparently don't have any matching ONT and WGBS dataset from the same sample yet, but are looking forward to generate them. May I know, for the model training, is there any specification on the dataset (e.g.coverage, etc) that you would need specifically.

Thanks again, Cen

umahsn commented 1 year ago

Hi,

For CpG, we were able to achieve very high performance (~94% F1) with ~30X NA12878 native ONT dataset using consensus of two WGBS replicates for ground truth, and we achieved slightly better performance (~95% F1) with ~90X HG002 native ONT dataset using a single WGBS sample.

On the other hand, we also achieved very good performance (~90-93% F1) when we trained using low coverage synthetically methylated and unmethylated controls of HG001 from Simpson. I believe both controls are less than 5X coverage.

I think the important thing in both cases is having sufficient total number of reads with high degree of confidence regarding their methylation. This can come from 1) high coverage at a few sites that have high confidence labels, or 2) low coverage at several sites that have extremely high confidence labels. In case of 1), for native ONT datasets, we trained models using ground truth labels from WGBS with a very strict criteria, i.e. minimum coverage in WGBS of at least 5, and all replicates had to have 100% methylation to be considered methylated or all replicates had to have 0% methylation to be considered unmethylated. Even with this strict criteria, there were ~850k-1M CpG sites for model training, and paired with at least 30X coverage, thats a lot of training data. Whereas for 2), using synthetic positive and negative control, even though coverage was low, we used all ~50million CpG sites for training since we had great confidence in each site in positive and negative control being methylated and unmethylated, respectively.

In short, high coverage ONT will only help for training if you can assign methylated or unmethylated labels to the reads with high confidence. Which is why it is very important to place more emphasis on generating proper ground truth labels for the motifs you are interested in, whether via WGBS or synthetically. Please let me know if you have more questions.

yusmiatiliau commented 1 year ago

Thanks @umahsn and apologise for the delay in responding. I will get in touch again once we have the WGBS data in hand