how to train a model - Githubissues

ShangjinTan commented 5 years ago

Hi Deepsignal,

I am impressed by the high sensitivity and accuracy of deepsignal in calling methylation sites. I would very much like to try it in my study. Here I have a few questions.

Deepsignal only provides a human CpG model. I want is to extract all methylation motifs (not only CpG) of all methylaiton types (6mA, 5mC, 4mC) from microorganisms. So it seems I have to train a custom model. Am I right?
deepsignal extract can extract features for training. Could you please explain a little bit about what exactly is extracted?
I have tried deepsignal extract on the example yeast data. The methy_label of all positions are all '1'. Does '1' mean that this position will be used for training? What does '1' mean?
If the result of deepsignal extract is used for training a model, how can deepsignal know which base is methylated?
deepsignal extracts selected motifs with the same mod_loc. If I want to extract all types of motifs (probably with different mod_loc), including novel motifs. Does this mean that deepsignal extract is not applicable to me?
For training a model, if the input is a pool of all methylation types, is there a requirement for the number of a type, or of a specific motif of a type?
Could you please give some advice on how to prepare the files for training a model?

Thank you so much. Shangjin

PengNi commented 5 years ago

Hi @ShangjinTan ,

Thanks for your interest.

Currently different motifs/methylation types need different deepsignal models. A custom model is needed for a non-CpG methylation type.
the extraction module extracts five kinds of features (one for CNN and 4 for RNN) for deepsignal. One line represents one sample for training/testing. The detail of the output format are in the README. More details are in the preprint manuscript.
The methy_label has two choices [0, 1]. 0 represents unmethylated, 1 represents methylated.
as 3.
A motif seq follows IUPAC alphabet can be can be trained (check the --motifs and _--modloc options). However, deepsignal cannot guarantee high performance for multi methylation types (even multi motifs) with a single model. Currently we've test models for CpG, GATC (6mA), CCWGG (5mC) separately.
as 5.
To train a model, methylated and unmethylated samples from reads are necessary. The samples can be chosen either from methylase-treated/PCR-amplified data or based on bisulfite sequencing, or other sequencing technique.

The chosen samples then can be shuffled and splited to training and validting datasets. According to our experiments, a model can be trained to achieve high performance by at most 20m samples for training and at least 10k samples for validting (half positive samples, half negative samples).

Some scripts from /scripts may be useful. Feel free to ask any more details and scripts to the email nipeng at csu.edu.cn.

Best, Peng

ardakdemir commented 4 years ago

I am interested in training my own model. Would it be possible for you to share with the community the datasets you have used for training? Or any reference to a database containing a dataset that can be used for training deepsignal (raw nanopore signals and methylation labels for each read) would also be very much appreciated!

Thanks in advance!

PengNi commented 4 years ago

Hi @ardakdemir ,

First you can check out nanopolish. The data (PRJEB13021) contains R9 reads of E.coli and Human NA12878. The reads are either totally methylated or totally unmethylated for 5mC.
The dataset from signalAlign contains reads for 6mA in GATC.
We also used 30x R9.4 reads of human NA12878 (PRJEB23027) from this work (nbt.4060). We get the high-confidence 5mC positions from the bisulfite sequencing (ENCFF835NTC).

Best, Peng

ardakdemir commented 4 years ago

Thanks a lot for the suggestions!

Best

Arda

ardakdemir commented 4 years ago

Dear @PengNi

"First you can check out nanopolish. The data (PRJEB13021) contains R9 reads of E.coli and Human NA12878. The reads are either totally methylated or totally unmethylated for 5mC."

The dataset you mentioned above contains many files. Which ones did you use for training? And how should I infer whether the reads are methylated or unmethylated? Is the information contained inside the fast5 files?

PengNi commented 4 years ago

Hi @ardakdemir ,

We use E.coli R9 reads for training and testing. You can recognize the type of files by the filenames. The file of which the filename contains "pcr" means the reads are unmethylated. "pcr_MSssI" means the reads are methylated. You can read their paper for double-check.

Best, Peng

ardakdemir commented 4 years ago

Thanks a lot!

ardakdemir commented 4 years ago

How can I obtain the same reference you used for mapping the fast5 files for :

E. coli K12 ER2925

I could not find any reference for ER2925

PengNi commented 4 years ago

I used this reference: ftp://ftp.ensemblgenomes.org/pub/release-29/bacteria//fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655/dna/Escherichia_coli_str_k_12_substr_mg1655.GCA_000005845.2.29.dna.genome.fa.gz

ardakdemir commented 4 years ago

Thanks! I also downloaded that but tombo gives:

Poor raw to expected signal matching

error, and suggests (revert with tombo filter clear_filters)

Did you experience anything similar?

PengNi commented 4 years ago

tombo only supports R9.4+ reads. If you want to process the E.coli R9 2D reads, you can use nanoraw.

Also, I suggest you use the R9.4 reads (maybe human NA12878 (PRJEB23027) ) for experiments too. Nanopore may no longer use R9 2D flowcell anymore.

ardakdemir commented 4 years ago

Thanks a lot for the information. I wonder how using the raw basecalls would affect the final performance on read level?

Do you think we can skip the resquiggle step and do the methylation calling directly from nanopore basecalls? We may not always have the reference for the resquiggle step

Peng Ni notifications@github.com, 12 Eki 2019 Cmt, 21:08 tarihinde şunu yazdı:

tombo only supports R9.4+ reads. If you want to process the E.coli R9 2D reads, you can use nanoraw https://github.com/marcus1487/nanoraw.

Also, I suggest you use the R9.4 reads (maybe human NA12878 (PRJEB23027) ) for experiments too. Nanopore may no longer use R9 2D flowcell anymore.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bioinfomaticsCSU/deepsignal/issues/7?email_source=notifications&email_token=AC5IHLSWH5TZU3T6KUJ35KLQOG435A5CNFSM4G7E7X52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBB56JI#issuecomment-541318949, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC5IHLVPNTEO2AYDLTTUKB3QOG435ANCNFSM4G7E7X5Q .

PengNi commented 4 years ago

Emm, in my opinion, it makes no sense to call methylation without a reference. We always need to align reads to a genome to do some analysis.

bioinfomaticsCSU / deepsignal

how to train a model #7