WGLab / DeepMod

DeepMod: a deep-learning tool for genomic-scale, strand-sensitive and single-nucleotide based detection of DNA modifications
Other
97 stars 35 forks source link

What datasets can be used to run DeepMod #45

Open gottaMe opened 3 years ago

gottaMe commented 3 years ago

Hi, Liu I'm very interested in the DeepMod, and want to use it to call the 5-mc methylation in the datasets provided by Simpon et.al, and the datasets is downloaded from the https://www.ebi.ac.uk/ena/browser/view/PRJEB13021.

Since I don't have my own GPU, so I try to test these data in a GPU server. I download the dataset named 'ecoli_er2925.pcr.timp.021216.tar.gz' (75.3 GB) and 'ecoli_er2925.pcr_MSssI.timp.021216.tar.gz' (59.5 GB). Unfortunately, these dataset can't be uploaded to GPU server because of the large size, so I extract a part of data corresponding to ch67 and zip them to upload to the GPU server.

When I try to use the model 'mod_train_sinmodC_P100wd21_f3ne1u0' to call the 5-mc methylation on the data about ch67 in 'ecoli_er2925.pcr_MSssl.timp.021216.tar.gz', I get the following error information:

Error!!! No Raw_reads/Signal data /Raw/Reads in data/meth10_lib3/ecoli_er2925.pcr_MSssI.timp.021216.fast5_small/fail/kelvin_021116_methecoli_4101_1_ch67_file7_strand.fast5

Then I use the following command to examine this file:

h5ls -r kelvin_021116_methecoli_4101_1_ch67_file101_strand.fast5

Indeed, this fast5 file doesn't have Raw_reads/Signals information. Then I check other fast5 files about ch67 in 'ecoli_er2925.pcr_MSssl.timp.021216.tar.gz', and other fast5 file also doesn't have raw_read/Signal information.

So, I'm wondering how do you train or test the DeepMod if you use the dataset provided by Simpon et.al. If the datasets provided by Simpon et.al is partially usable, could you tell me which part of the data you use to train and test the model. Additional, Is there any other public datasets I can use to run DeepMod?

I'm just new to using nanopores to detect methylation, so maybe some strange questions were asked, but I still hope and appreciate you can help me to deal with these problems.

Yours Chen.

liuqianhn commented 2 years ago

@gottaMe Thanks for being interested in DeepMod. Could you please show what you have via h5ls -r YOUR-fast5 | head -n 50? Meanwhile, data/meth10_lib3/ecoli_er2925.pcr_MSssI.timp.021216.fast5_small/fail/kelvin_021116_methecoli_4101_1_ch67_file7_strand.fast5 is from a fail folder, which might not contain useful fast5. If you have pass folder together with fail folder, please use fast5 from pass folder.

gottaMe commented 2 years ago

Thanks for your reply!

I tried to use the files from pass fold to test the DeepMod, but it still report the error:

Error!!! No Raw_reads/Signal data /Raw/Reads in data/Control_lib1/ecoli_er2925.pcr.timp.021216.fast5_small/pass/imperial_021116_unmethecoli_3923_1_ch67_file60_strand.fast5

The results of h5ls -r imperial_021116_unmethecoli_3923_1_ch67_file60_strand.fast5 | head -n 50 as follow (this fast5 file is one of the file in the pass fold):

/ Group /Analyses Group /Analyses/Basecall_1D_000 Group /Analyses/Basecall_1D_000/BaseCalled_complement Group /Analyses/Basecall_1D_000/BaseCalled_complement/Events Dataset {3499} /Analyses/Basecall_1D_000/BaseCalled_complement/Fastq Dataset {SCALAR} /Analyses/Basecall_1D_000/BaseCalled_complement/Model Dataset {4096} /Analyses/Basecall_1D_000/BaseCalled_template Group /Analyses/Basecall_1D_000/BaseCalled_template/Events Dataset {3819} /Analyses/Basecall_1D_000/BaseCalled_template/Fastq Dataset {SCALAR} /Analyses/Basecall_1D_000/BaseCalled_template/Model Dataset {4096} /Analyses/Basecall_1D_000/Configuration Group /Analyses/Basecall_1D_000/Configuration/aggregator Group /Analyses/Basecall_1D_000/Configuration/basecall_1d Group /Analyses/Basecall_1D_000/Configuration/basecall_2d Group /Analyses/Basecall_1D_000/Configuration/calibration_strand Group /Analyses/Basecall_1D_000/Configuration/components Group /Analyses/Basecall_1D_000/Configuration/general Group /Analyses/Basecall_1D_000/Configuration/hairpin_align Group /Analyses/Basecall_1D_000/Configuration/post_processing Group /Analyses/Basecall_1D_000/Configuration/post_processing.3000Hz Group /Analyses/Basecall_1D_000/Configuration/split_hairpin Group /Analyses/Basecall_1D_000/Log Dataset {SCALAR} /Analyses/Basecall_1D_000/Summary Group /Analyses/Basecall_1D_000/Summary/basecall_1d_complement Group /Analyses/Basecall_1D_000/Summary/basecall_1d_template Group /Analyses/Basecall_2D_000 Group /Analyses/Basecall_2D_000/BaseCalled_2D Group /Analyses/Basecall_2D_000/BaseCalled_2D/Alignment Dataset {4468} /Analyses/Basecall_2D_000/BaseCalled_2D/Fastq Dataset {SCALAR} /Analyses/Basecall_2D_000/Configuration Group /Analyses/Basecall_2D_000/Configuration/aggregator Group /Analyses/Basecall_2D_000/Configuration/basecall_1d Group /Analyses/Basecall_2D_000/Configuration/basecall_2d Group /Analyses/Basecall_2D_000/Configuration/calibration_strand Group /Analyses/Basecall_2D_000/Configuration/components Group /Analyses/Basecall_2D_000/Configuration/general Group /Analyses/Basecall_2D_000/Configuration/hairpin_align Group /Analyses/Basecall_2D_000/Configuration/post_processing Group /Analyses/Basecall_2D_000/Configuration/post_processing.3000Hz Group /Analyses/Basecall_2D_000/Configuration/split_hairpin Group /Analyses/Basecall_2D_000/HairpinAlign Group /Analyses/Basecall_2D_000/HairpinAlign/Alignment Dataset {3217} /Analyses/Basecall_2D_000/Log Dataset {SCALAR} /Analyses/Basecall_2D_000/Summary Group /Analyses/Basecall_2D_000/Summary/basecall_2d Group /Analyses/Basecall_2D_000/Summary/hairpin_align Group /Analyses/Basecall_2D_000/Summary/post_process_complement Group /Analyses/Basecall_2D_000/Summary/post_process_template Group /Analyses/Calibration_Strand_000 Group

liuqianhn commented 2 years ago

@gottaMe It seems that the fast5 files have a lot of basecalling info, and I am wondering whether you can post the all output h5ls -r imperial_021116_unmethecoli_3923_1_ch67_file60_strand.fast5. Thanks.

gottaMe commented 2 years ago

Thanks for your reply!

The all outputs of the command h5ls -r imperial_021116_unmethecoli_3923_1_ch67_file60_strand.fast5 are as follow:

/ Group /Analyses Group /Analyses/Basecall_1D_000 Group /Analyses/Basecall_1D_000/BaseCalled_complement Group /Analyses/Basecall_1D_000/BaseCalled_complement/Events Dataset {3499} /Analyses/Basecall_1D_000/BaseCalled_complement/Fastq Dataset {SCALAR} /Analyses/Basecall_1D_000/BaseCalled_complement/Model Dataset {4096} /Analyses/Basecall_1D_000/BaseCalled_template Group /Analyses/Basecall_1D_000/BaseCalled_template/Events Dataset {3819} /Analyses/Basecall_1D_000/BaseCalled_template/Fastq Dataset {SCALAR} /Analyses/Basecall_1D_000/BaseCalled_template/Model Dataset {4096} /Analyses/Basecall_1D_000/Configuration Group /Analyses/Basecall_1D_000/Configuration/aggregator Group /Analyses/Basecall_1D_000/Configuration/basecall_1d Group /Analyses/Basecall_1D_000/Configuration/basecall_2d Group /Analyses/Basecall_1D_000/Configuration/calibration_strand Group /Analyses/Basecall_1D_000/Configuration/components Group /Analyses/Basecall_1D_000/Configuration/general Group /Analyses/Basecall_1D_000/Configuration/hairpin_align Group /Analyses/Basecall_1D_000/Configuration/post_processing Group /Analyses/Basecall_1D_000/Configuration/post_processing.3000Hz Group /Analyses/Basecall_1D_000/Configuration/split_hairpin Group /Analyses/Basecall_1D_000/Log Dataset {SCALAR} /Analyses/Basecall_1D_000/Summary Group /Analyses/Basecall_1D_000/Summary/basecall_1d_complement Group /Analyses/Basecall_1D_000/Summary/basecall_1d_template Group /Analyses/Basecall_2D_000 Group /Analyses/Basecall_2D_000/BaseCalled_2D Group /Analyses/Basecall_2D_000/BaseCalled_2D/Alignment Dataset {4468} /Analyses/Basecall_2D_000/BaseCalled_2D/Fastq Dataset {SCALAR} /Analyses/Basecall_2D_000/Configuration Group /Analyses/Basecall_2D_000/Configuration/aggregator Group /Analyses/Basecall_2D_000/Configuration/basecall_1d Group /Analyses/Basecall_2D_000/Configuration/basecall_2d Group /Analyses/Basecall_2D_000/Configuration/calibration_strand Group /Analyses/Basecall_2D_000/Configuration/components Group /Analyses/Basecall_2D_000/Configuration/general Group /Analyses/Basecall_2D_000/Configuration/hairpin_align Group /Analyses/Basecall_2D_000/Configuration/post_processing Group /Analyses/Basecall_2D_000/Configuration/post_processing.3000Hz Group /Analyses/Basecall_2D_000/Configuration/split_hairpin Group /Analyses/Basecall_2D_000/HairpinAlign Group /Analyses/Basecall_2D_000/HairpinAlign/Alignment Dataset {3217} /Analyses/Basecall_2D_000/Log Dataset {SCALAR} /Analyses/Basecall_2D_000/Summary Group /Analyses/Basecall_2D_000/Summary/basecall_2d Group /Analyses/Basecall_2D_000/Summary/hairpin_align Group /Analyses/Basecall_2D_000/Summary/post_process_complement Group /Analyses/Basecall_2D_000/Summary/post_process_template Group /Analyses/Calibration_Strand_000 Group /Analyses/Calibration_Strand_000/Configuration Group /Analyses/Calibration_Strand_000/Configuration/aggregator Group /Analyses/Calibration_Strand_000/Configuration/basecall_1d Group /Analyses/Calibration_Strand_000/Configuration/basecall_2d Group /Analyses/Calibration_Strand_000/Configuration/calibration_strand Group /Analyses/Calibration_Strand_000/Configuration/components Group /Analyses/Calibration_Strand_000/Configuration/general Group /Analyses/Calibration_Strand_000/Configuration/hairpin_align Group /Analyses/Calibration_Strand_000/Configuration/post_processing Group /Analyses/Calibration_Strand_000/Configuration/post_processing.3000Hz Group /Analyses/Calibration_Strand_000/Configuration/split_hairpin Group /Analyses/Calibration_Strand_000/Log Dataset {SCALAR} /Analyses/Calibration_Strand_000/Summary Group /Analyses/EventDetection_000 Group /Analyses/EventDetection_000/Configuration Group /Analyses/EventDetection_000/Configuration/abasic_detection Group /Analyses/EventDetection_000/Configuration/event_detection Group /Analyses/EventDetection_000/Configuration/hairpin_detection Group /Analyses/EventDetection_000/Reads Group /Analyses/EventDetection_000/Reads/Read_58 Group /Analyses/EventDetection_000/Reads/Read_58/Events Dataset {7371} /Analyses/Hairpin_Split_000 Group /Analyses/Hairpin_Split_000/Configuration Group /Analyses/Hairpin_Split_000/Configuration/aggregator Group /Analyses/Hairpin_Split_000/Configuration/basecall_1d Group /Analyses/Hairpin_Split_000/Configuration/basecall_2d Group /Analyses/Hairpin_Split_000/Configuration/calibration_strand Group /Analyses/Hairpin_Split_000/Configuration/components Group /Analyses/Hairpin_Split_000/Configuration/general Group /Analyses/Hairpin_Split_000/Configuration/hairpin_align Group /Analyses/Hairpin_Split_000/Configuration/post_processing Group /Analyses/Hairpin_Split_000/Configuration/post_processing.3000Hz Group /Analyses/Hairpin_Split_000/Configuration/split_hairpin Group /Analyses/Hairpin_Split_000/Log Dataset {SCALAR} /Analyses/Hairpin_Split_000/Summary Group /Analyses/Hairpin_Split_000/Summary/split_hairpin Group /Sequences Group /Sequences/Meta Group /UniqueGlobalKey Group /UniqueGlobalKey/channel_id Group /UniqueGlobalKey/context_tags Group /UniqueGlobalKey/tracking_id Group

liuqianhn commented 2 years ago

Hi @gottaMe From the output of h5ls, it seems that there is no group of "Raw_reads/Signals" for signals. Although I suspect "/Sequences/Meta" is for signals, I am not sure about this before I read the fast5. I have been trying to download the data (not successful due to a potential firewall issue and I will fix is later), but it would be great if you can share a single fast5 for me to check.

gottaMe commented 2 years ago

Thanks for your reply!

Here is the test fast5 file, which is the file used in the command h5ls -r imperial_021116_unmethecoli_3923_1_ch67_file60_strand.fast5 and the corresponding file directory is \ecoli_er2925.pcr.timp.021216.fast5\pass\imperial_021116_unmethecoli_3923_1_ch67_file60_strand.fast5

imperial_021116_unmethecoli_3923_1_ch67_file60_strand.zip

liuqianhn commented 2 years ago

@gottaMe Thanks for sharing this file. I downloaded it and checked it carefully: unfortunately, I do NOT find raw signals info in the file. I have no clue why, since usually there is raw signal data in fast5 generated by Nanopore sequencer.

gottaMe commented 2 years ago

OK,thank you for your help!