PacificBiosciences / kineticsTools

Tools for detecting DNA modifications from single molecule, real-time sequencing data
19 stars 21 forks source link

the data set for model training #87

Open SarahChen0401 opened 3 years ago

SarahChen0401 commented 3 years ago

I want to know what is the training and testing dataset for ipdSummry model to detect DNA methylation.

In this paper "Direct detection of DNA methylation during single-molecule, real-time sequencing", they said that " we designed several synthetic DNA templates that were identical except for their methylation status at specific sites. The control template contained no methylation, and other templates contained several mA, mC or hmC bases." In suplementary, they said that "The sequences of the four synthetic DNA templates (control, mA, mC, and hmC) described in Figs. 1-4 are 199bp." Is 199bp enough for model training? We human have 3 billion bases.

So I didn't find the model training dataset. Is there any paper show us the training dataset for ipdSummary model construction?

rhallPB commented 3 years ago

The current model is trained on the DNA from 5 WGA bacterial samples and 7 native bacteria with well characterized RM systems and therefore know modifications. Given the limitations of the current training sets, only the identification of 6mA and 4mC is supported. The training datasets are not public. What specifically do you need from the training data?

SarahChen0401 commented 3 years ago

The current model is trained on the DNA from 5 WGA bacterial samples and 7 native bacteria with well characterized RM systems and therefore know modifications. Given the limitations of the current training sets, only the identification of 6mA and 4mC is supported. The training datasets are not public. What specifically do you need from the training data?

Does PacBio have the plan to train a model to call m5C from human genome? like we can generate WGA human samples and methylated human samples using some enzymes like M.ssI converting all C in CG into mC. Do you have some suggestions on the model training to predict m5C in human genome?

rhallPB commented 3 years ago

We have been looking into this recently and have some public datasets to share: Host: ftp2.pacificbiosciences.com Username: baseMod Password: m5CControlTest The data is in the kinetic format from Sequel IIe, HiFi Kinetics If you would like to continue this conversation, please email support@pacb.com and ask them to forward the request to Richard Hall.