bioinfomaticsCSU / deepsignal

Detecting methylation using signal-level features from Nanopore sequencing reads
GNU General Public License v3.0
108 stars 21 forks source link

Result verification #31

Closed ritma001 closed 4 years ago

ritma001 commented 4 years ago

Hi there,

I recently use Deepsignal to detect DNA methylation and received the result as seen below.

1       387268  -       4855350 079a30f7-9f4c-4a7a-ad5b-39e3ef61530b    t       0.565757        0.43424305      0       CTGCCTGTCGTGGGCGG
1       2781788 -       2460830 0802b481-faa3-4aa6-9572-54fa77053c81    t       0.057231583     0.9427684       1       CCCTGATCCGGACGGAA
1       2780877 -       2461741 0802b481-faa3-4aa6-9572-54fa77053c81    t       0.06506838      0.9349316       1       ACGCAGCGCGAGGATCT

This is the result from E. coli genome sequencing and I used .ckpt from model.GATC.R9_2D.tem.puc19.bn17.sn360.tar.gz for --model_path.

From this output snippet, I understand that the last two rows show methylated C at 9th position of 17 mers. I also confirmed the predicted methylation position with the genome position giving in the 2nd column.

However, I observe G_A_TC, which is a recognition motif for one of the DNA methyltransferase (methylated nucleotide in this motif is flanked by _ ), not in the overlapping regions with the predicted nucleotide methylation. Interestingly, the motif usually appears elsewhere in 17 mers.

I also checked for other recognition motifs i.e. A_A_CGTCG, CC[A/T]GG and ATGC_A_T of the different DNA methyltransferases as mentioned in PMC4231299. None of expected methylated nucleotides (flanked by _ ) in these motifs are found in the predicted methylation (9th nucleotide in 17 mers). But I found these motifs frequently pop up at various positions in the 17 mers.

So, I doubt how reliable the prediction is and if I interpret the result correctly. It is quite a detailed question and I would be glad to receive any feedback.

Best,

Wannisa

PengNi commented 4 years ago

Hi @ritma001 ,

Thanks for your interest. To call modifications by deepsignal, a motif sequence must be set. See --motifs and --mod_loc in extract or call_mods module for more details (deepsignal.py#278 ).

As a model-based method, deepsignal cannot predict all motifs existed. Currently we only have two trained models, one for CpG and another for GATC.

Best, Peng

ritma001 commented 4 years ago

Dear Peng,

Thank for your quick response. I have 2 follow-up questions:

Q1: Since I use the GATC model, should not I expect the prediction of methylated "A" at 9th position of 17-mer stretch?

Q2: Is it necessary to call the modifications with defined motif at the first place? Let me clarify that my initial goal is not to find a specific motif but it is just a sanity check of the output.

If the model is developed based on GATC motifs, I would expect to see the the motif and its methylated "A" being predicted and shown in the middle of the 17-mer stretch (9th position).

Best,

Wannisa

PengNi commented 4 years ago

Dear Wannisa,

  1. If you use the GATC model, you should also set --motifs to GATC and --mod_loc to 1, to extract only GATC kmer from fast5s.

  2. Yes, it's necessary. Because deepsignal is a model-based model, it can't predict all motifs.

Best, Peng

ritma001 commented 4 years ago

Dear Peng,

Thank you for clarification. It is clear to me now.