bioinfomaticsCSU / deepsignal

Detecting methylation using signal-level features from Nanopore sequencing reads
GNU General Public License v3.0
108 stars 21 forks source link

Nanopore data for NA12878 #44

Closed PanZiwei closed 4 years ago

PanZiwei commented 4 years ago

Hi DeepSignal,

I really like your model design for the Nanopore methylation calling. But I have some issues and would really appreciate it if you can help:

  1. Did you only use the original .fast5 files from NA12878 dataset(https://www.ebi.ac.uk/ena/data/view/PRJEB13021) for the paper and then did basecall on your own? Have you ever considered rel6 genomic DNA?

  2. Is there any specific reason for using Albacore? I think Albacore is deprecated by Nanopore after R9.4 and Guppy is the main base caller.

  3. 4mC and 5mC exists at the same time in bacteria and cannot be distinguished by oxBS-seq because of the mechanism. How will the model deal with this issue?

  4. You are using highly-confidence 5mC from oxBS-seq to label your Nanopore reads for training, but the problem is that the abundance for 100%5mC should be very rare in NA12878, and in most cases, a CpG site should be mixed with 5mC and 5C in the genome level. How do you solve the problem?

  5. Can you provide more details on calculating accuracy at read-level? Do you calculate the percentage of correct methylation calls in a read? How about other evaluation parameters such as sensitivity, specificity, AUC at a read-level?

  6. Since Nanopore now released R10.3, is your method compatible with the latest version?

Thank you so much for your help!

Best, Ziwei

PengNi commented 4 years ago

Hi Ziwei,

Thanks for your interest.

1&2, We use Albacore to basecall all used dataset, including NA12878. There are no specific reason to use Albacore. I believe Guppy is a better option. The performance of our trained models on Guppy-called reads hasn't been tested rigorously. However, there are no significant differences between the results from Albacore and Guppy -called reads in our small test.

3, We didn't use bacteria BS data in the paper. We didn't consider the issue in human's test.

4, In NA12878, we select 5408142 100% methylated cytosines. I believe that is enough to train a model. If there are not enough samples for training, I suggest using lower the limit, like 90% methylated.

5, We use evaluate_mods_call.py to calculate accuracy and other metrics.

6, We haven't tested reads generated from new pore. The compatibility should be tested, and is not promised now.

Best, Peng

PanZiwei commented 4 years ago

Hi Peng,

Thank you so much for your help! The information are really beneficial. I will close the issue since all my questions have been covered.

Best, Ziwei