WGLab / DeepMod

DeepMod: a deep-learning tool for genomic-scale, strand-sensitive and single-nucleotide based detection of DNA modifications
Other
97 stars 35 forks source link

No Fastq data in fast5 #19

Closed qiuyixmm closed 4 years ago

qiuyixmm commented 4 years ago

@liuqianhn hello, i downloaded test data from http://s3.climb.ac.uk/nanopolish_tutorial/methylation_example.tar.gz, a subset of the NA12878 WGS Consortium data used in the tutorial of nanopolish calling methylation. The command line below like this:

python DeepMod.py detect \ --wrkBase ~/deepmod_test/fast5_files \ --Ref ~/deepmod_test/reference/reference.fasta \ --FileID test \ --modfile ~/DeepMod/train_mod/rnn_conmodC_P100wd21_f7ne1u0_4/mod_train_conmodC_P100wd21_f3ne1u0 \ --threads 5 --outFolder myoutput/

note: directory ~/deepmod_test/fast5_files canotains signal-level FAST5 files unpacked from the downlaod data package.

the error information: Nanopore sequencing data analysis is resourece-intensive and time consuming. Some potential strong recommendations are below: If your reference genome is large as human genome and your Nanopore data is huge, It would be faster to run this program parallelly to speed up. You might run different input folders of your fast5 files and give different output names (--FileID) or folders (--outFolder) A good way for this is to run different chromosome individually.

         Current directory: ~/software_test/deepmod_test
                  outLevel: 2
                   wrkBase: ~/deepmod_test/fast5_files
                    FileID: test
                 outFolder: myoutput/
                 recursive: 1
          files_per_thread: 1000
                   threads: 5
                windowsize: 21
                  alignStr: minimap2
               basecall_1d: Basecall_1D_000
          basecall_2strand: BaseCalled_template
                    ConUnk: True
               outputlayer: 
                      Base: C
               mod_cluster: 0
                   predDet: 1
                       Ref: ~/deepmod_test/reference/reference.fasta
                      fnum: 7
                    hidden: 100
                   modfile: ~/DeepMod/train_mod/rnn_conmodC_P100wd21_f7ne1u0_4/mod_train_conmodC_P100wd21_f3ne1u0
                    region: [[None, None, None]]

Total files=19275 Error!!! No Fastq data in ~/deepmod_test/fast5_files/nanopore2_20161128_FNFAB49712_MN17633_sequencing_run_20161128_Human_Qiagen_1D_R9_4_64849_ch388_read4650_strand.fast5 ... ...

Beside this, i also used my own data to run DeepMod and same errors " No Fastq data in *.fast5" were produced.

Could you please help me with providing some solution ? Thanks !

liuqianhn commented 4 years ago

Hi @qiuyixmm , could you please show what is the output of h5ls -r ~/deepmod_test/fast5_files/nanopore2_20161128_FNFAB49712_MN17633_sequencing_run_20161128_Human_Qiagen_1D_R9_4_64849_ch388_read4650_strand.fast5?

qiuyixmm commented 4 years ago

this is content for test data: h5ls -r nanopore2_20161128_FNFAB49712_MN17633_sequencing_run_20161128_Human_Qiagen_1D_R9_4_64849_ch388_read4650_strand.fast5 / Group /Analyses Group /Analyses/Segment_Linear_000 Group /Analyses/Segment_Linear_000/Summary Group /Analyses/Segment_Linear_000/Summary/split_hairpin Group /Raw Group /Raw/Reads Group /Raw/Reads/Read_4650 Group /Raw/Reads/Read_4650/Signal Dataset {157207/Inf} /UniqueGlobalKey Group /UniqueGlobalKey/channel_id Group /UniqueGlobalKey/context_tags Group /UniqueGlobalKey/tracking_id Group

this is content for my own data: h5ls -r GXB01143_20180313_FAH59244_GA10000_sequencing_run_20180313_NPL0039_E1_81227_read_18378_ch_446_strand.fast5 / Group /Raw Group /Raw/Reads Group /Raw/Reads/Read_18378 Group /Raw/Reads/Read_18378/Signal Dataset {136801/Inf} /UniqueGlobalKey Group /UniqueGlobalKey/channel_id Group /UniqueGlobalKey/context_tags Group /UniqueGlobalKey/tracking_id Group

qiuyixmm commented 4 years ago

Additionally, i also downlaoded a subset of Na12878 Nanopore sequencing data(http://s3.amazonaws.com/nanopore-human-wgs/rel3-fast5-chr20.part05.tar) used in Example 3: Detect 5mC on Na12878. The running of DeepMod is successful for that i get the bed format results.

this is the content of one FAST5 (Signal Level files): h5ls -r PLSP61583_20161129_FNFAB49914_MN17048_sequencing_run_Hu_Nott_Bi_FC4_tune_92763_ch488_read194_strand.fast5 / Group /Analyses Group /Analyses/Basecall_1D_000 Group /Analyses/Basecall_1D_000/BaseCalled_template Group /Analyses/Basecall_1D_000/BaseCalled_template/Events Dataset {28908} /Analyses/Basecall_1D_000/BaseCalled_template/Fastq Dataset {SCALAR} /Analyses/Basecall_1D_000/Configuration Group /Analyses/Basecall_1D_000/Configuration/aggregator Group /Analyses/Basecall_1D_000/Configuration/basecall_1d Group /Analyses/Basecall_1D_000/Configuration/calibration_strand Group /Analyses/Basecall_1D_000/Configuration/components Group /Analyses/Basecall_1D_000/Configuration/event_detection Group /Analyses/Basecall_1D_000/Configuration/general Group /Analyses/Basecall_1D_000/Configuration/genome_mapping Group /Analyses/Basecall_1D_000/Configuration/split_hairpin Group /Analyses/Basecall_1D_000/Log Dataset {SCALAR} /Analyses/Basecall_1D_000/Summary Group /Analyses/Basecall_1D_000/Summary/basecall_1d_template Group /Analyses/Calibration_Strand_000 Group /Analyses/Calibration_Strand_000/Configuration Group /Analyses/Calibration_Strand_000/Configuration/aggregator Group /Analyses/Calibration_Strand_000/Configuration/basecall_1d Group /Analyses/Calibration_Strand_000/Configuration/basecall_2d Group /Analyses/Calibration_Strand_000/Configuration/calibration_strand Group /Analyses/Calibration_Strand_000/Configuration/components Group /Analyses/Calibration_Strand_000/Configuration/general Group /Analyses/Calibration_Strand_000/Configuration/genome_mapping Group /Analyses/Calibration_Strand_000/Configuration/hairpin_align Group /Analyses/Calibration_Strand_000/Configuration/post_processing.3000Hz Group /Analyses/Calibration_Strand_000/Configuration/split_hairpin Group /Analyses/Calibration_Strand_000/Log Dataset {SCALAR} /Analyses/Calibration_Strand_000/Summary Group /Analyses/EventDetection_000 Group /Analyses/EventDetection_000/Configuration Group /Analyses/EventDetection_000/Configuration/aggregator Group /Analyses/EventDetection_000/Configuration/basecall_1d Group /Analyses/EventDetection_000/Configuration/calibration_strand Group /Analyses/EventDetection_000/Configuration/components Group /Analyses/EventDetection_000/Configuration/event_detection Group /Analyses/EventDetection_000/Configuration/general Group /Analyses/EventDetection_000/Configuration/split_hairpin Group /Analyses/EventDetection_000/Log Dataset {SCALAR} /Analyses/EventDetection_000/Reads Group /Analyses/EventDetection_000/Reads/Read_194 Group /Analyses/EventDetection_000/Reads/Read_194/Events Dataset {29484} /Analyses/EventDetection_000/Summary Group /Analyses/EventDetection_000/Summary/event_detection Group /Analyses/Segment_Linear_000 Group /Analyses/Segment_Linear_000/Configuration Group /Analyses/Segment_Linear_000/Configuration/aggregator Group /Analyses/Segment_Linear_000/Configuration/basecall_1d Group /Analyses/Segment_Linear_000/Configuration/calibration_strand Group /Analyses/Segment_Linear_000/Configuration/components Group /Analyses/Segment_Linear_000/Configuration/general Group /Analyses/Segment_Linear_000/Configuration/split_hairpin Group /Analyses/Segment_Linear_000/Log Dataset {SCALAR} /Analyses/Segment_Linear_000/Summary Group /Analyses/Segment_Linear_000/Summary/split_hairpin Group /Raw Group /Raw/Reads Group /Raw/Reads/Read_194 Group /Raw/Reads/Read_194/Signal Dataset {328148/Inf} /UniqueGlobalKey Group /UniqueGlobalKey/channel_id Group /UniqueGlobalKey/context_tags Group /UniqueGlobalKey/tracking_id Group

It seems that these fast5 files contain more contents compared to my own data and first test data. But I dont konwn whether the completeness of fast5 files is the causation. Because it is done successful, but there are still some same errors as before:

Nanopore sequencing data analysis is resourece-intensive and time consuming. Some potential strong recommendations are below: If your reference genome is large as human genome and your Nanopore data is huge, It would be faster to run this program parallelly to speed up. You might run different input folders of your fast5 files and give different output names (--FileID) or folders (--outFolder) A good way for this is to run different chromosome individually.

         Current directory: ~/deepmod_test
                  outLevel: 2
                   wrkBase: ~/deepmod/fast5_file
                    FileID: test
                 outFolder: myoutput/
                 recursive: 1
          files_per_thread: 1000
                   threads: 5
                windowsize: 21
                  alignStr: minimap2
               basecall_1d: Basecall_1D_000
          basecall_2strand: BaseCalled_template
                    ConUnk: True
               outputlayer: 
                      Base: C
               mod_cluster: 0
                   predDet: 1
                       Ref: ~/deepmod/reference/human_refernce_genome.fa
                      fnum: 7
                    hidden: 100
                   modfile: ~/DeepMod/train_mod/rnn_conmodC_P100wd21_f7ne1u0_4/mod_train_conmodC_P100wd21_f3ne1u0
                    region: [[None, None, None]]

Total files=2772 Error!!! No Fastq data in ~/fast5_file/MinION2_20161027_FNFAB42476_MN20093_sequencing_run_Chip102_Genomic_R9_4_450bps_40738_ch178_read503_strand.fast5 Error!!! No Fastq data in ~/fast5_file/MinION2_20161020_FNFAB42473_MN20093_sequencing_run_Chip101_Genomic_R9_4_450bps_tune_74642_ch375_read753_strand1.fast5 Error!!! No events data in ~/fast5_file/PLSP61583_20161021_FNFAB42561_MN17048_sequencing_run_94_II_Hum_2_24_tune_75076_ch124_read559_strand.fast5 ... ...

Besides the same errors, there are some other messages like these:

Cur Prediction consuming time 1102 for 0 2 Cur Prediction consuming time 2031 for 0 0 Cur Prediction consuming time 2140 for 0 1 Error information for different fast5 files: No events data 8 No Fastq data 21 Not in alignment sam 685 Per-read Prediction consuming time 2149 Find: myoutput//test 25 rnn.pred.ind ['myoutput//test/rnn.pred.ind.chr9', 'myoutput//test/rnn.pred.ind.chr14', 'myoutput//test/rnn.pred.ind.chr17', 'myoutput//test/rnn.pred.ind.chr20', 'myoutput//test/rnn.pred.ind.chr3', 'myoutput//test/rnn.pred.ind.chr2', 'myoutput//test/rnn.pred.ind.chr15', 'myoutput//test/rnn.pred.ind.chr10', 'myoutput//test/rnn.pred.ind.chr7', 'myoutput//test/rnn.pred.ind.chr5', 'myoutput//test/rnn.pred.ind.chrY', 'myoutput//test/rnn.pred.ind.chr16', 'myoutput//test/rnn.pred.ind.chr6', 'myoutput//test/rnn.pred.ind.chr13', 'myoutput//test/rnn.pred.ind.chr22', 'myoutput//test/rnn.pred.ind.chr8', 'myoutput//test/rnn.pred.ind.chrX', 'myoutput//test/rnn.pred.ind.chr19', 'myoutput//test/rnn.pred.ind.chr21', 'myoutput//test/rnn.pred.ind.chr1', 'myoutput//test/rnn.pred.ind.chr18', 'myoutput//test/rnn.pred.ind.chr11', 'myoutput//test/rnn.pred.ind.chr4', 'myoutput//test/rnn.pred.ind.chrM', 'myoutput//test/rnn.pred.ind.chr12'] ====sum done! To save Save myoutput/test/mod_pos.chr9-.C.bed ====sum done! To save Save myoutput/test/mod_pos.chr2-.C.bed ====sum done! To save Save myoutput/test/mod_pos.chr10-.C.bed ====sum done! To save Save myoutput/test/mod_pos.chr5-.C.bed ====sum done! To save Save myoutput/test/mod_pos.chr6-.C.bed ====sum done! To save Save myoutput/test/mod_pos.chrX+.C.bed ====sum done! To save Save myoutput/test/mod_pos.chrX-.C.bed ====sum done! To save Save myoutput/test/mod_pos.chr19-.C.bed ====sum done! To save Save myoutput/test/mod_pos.chr1-.C.bed ... ...

I can just provide these information. I hope it is useful.

liuqianhn commented 4 years ago

The error is because fast5 is not basecalled with fq and event info. Rebasecall with albacore can solve the error.

qiuyixmm commented 4 years ago

The error is because fast5 is not basecalled with fq and event info. Rebasecall with albacore can solve the error.

If so, why same errors were reported for a few fast5 files in Na12878 Nanopore sequencing data used in Example 3: Detect 5mC on Na12878. This is the content of one error fast5 file : h5ls -r /GS01/project/pengms_group/pengms20t1/dir.xumm/sv_test/fast5_file/MinION2_20161027_FNFAB42476_MN20093_sequencing_run_Chip102_Genomic_R9_4_450bps_40738_ch178_read503_strand.fast5

/ Group /Analyses Group /Analyses/Basecall_1D_000 Group /Analyses/Basecall_1D_000/Summary Group /Analyses/Basecall_1D_000/Summary/basecall_1d_template Group /Analyses/Basecall_1D_001 Group /Analyses/Basecall_1D_001/BaseCalled_template Group /Analyses/Basecall_1D_001/BaseCalled_template/Events Dataset {32861} /Analyses/Basecall_1D_001/BaseCalled_template/Fastq Dataset {SCALAR} /Analyses/Basecall_1D_001/Configuration Group /Analyses/Basecall_1D_001/Configuration/aggregator Group /Analyses/Basecall_1D_001/Configuration/basecall_1d Group /Analyses/Basecall_1D_001/Configuration/calibration_strand Group /Analyses/Basecall_1D_001/Configuration/components Group /Analyses/Basecall_1D_001/Configuration/event_detection Group /Analyses/Basecall_1D_001/Configuration/general Group /Analyses/Basecall_1D_001/Configuration/genome_mapping Group /Analyses/Basecall_1D_001/Configuration/split_hairpin Group /Analyses/Basecall_1D_001/Log Dataset {SCALAR} /Analyses/Basecall_1D_001/Summary Group /Analyses/Basecall_1D_001/Summary/basecall_1d_template Group /Analyses/Calibration_Strand_000 Group /Analyses/Calibration_Strand_000/Configuration Group /Analyses/Calibration_Strand_000/Configuration/aggregator Group /Analyses/Calibration_Strand_000/Configuration/basecall_1d Group /Analyses/Calibration_Strand_000/Configuration/basecall_2d Group /Analyses/Calibration_Strand_000/Configuration/calibration_strand Group /Analyses/Calibration_Strand_000/Configuration/components Group /Analyses/Calibration_Strand_000/Configuration/general Group /Analyses/Calibration_Strand_000/Configuration/genome_mapping Group /Analyses/Calibration_Strand_000/Configuration/hairpin_align Group /Analyses/Calibration_Strand_000/Configuration/post_processing.3000Hz Group /Analyses/Calibration_Strand_000/Configuration/split_hairpin Group /Analyses/Calibration_Strand_000/Log Dataset {SCALAR} /Analyses/Calibration_Strand_000/Summary Group /Analyses/EventDetection_000 Group /Analyses/EventDetection_000/Configuration Group /Analyses/EventDetection_000/Configuration/aggregator Group /Analyses/EventDetection_000/Configuration/basecall_1d Group /Analyses/EventDetection_000/Configuration/calibration_strand Group /Analyses/EventDetection_000/Configuration/components Group /Analyses/EventDetection_000/Configuration/event_detection Group /Analyses/EventDetection_000/Configuration/general Group /Analyses/EventDetection_000/Configuration/split_hairpin Group /Analyses/EventDetection_000/Log Dataset {SCALAR} /Analyses/EventDetection_000/Reads Group /Analyses/EventDetection_000/Reads/Read_503 Group /Analyses/EventDetection_000/Reads/Read_503/Events Dataset {33632} /Analyses/EventDetection_000/Summary Group /Analyses/EventDetection_000/Summary/event_detection Group /Analyses/Segment_Linear_000 Group /Analyses/Segment_Linear_000/Configuration Group /Analyses/Segment_Linear_000/Configuration/aggregator Group /Analyses/Segment_Linear_000/Configuration/basecall_1d Group /Analyses/Segment_Linear_000/Configuration/calibration_strand Group /Analyses/Segment_Linear_000/Configuration/components Group /Analyses/Segment_Linear_000/Configuration/general Group /Analyses/Segment_Linear_000/Configuration/split_hairpin Group /Analyses/Segment_Linear_000/Log Dataset {SCALAR} /Analyses/Segment_Linear_000/Summary Group /Analyses/Segment_Linear_000/Summary/split_hairpin Group /Raw Group /Raw/Reads Group /Raw/Reads/Read_503 Group /Raw/Reads/Read_503/Signal Dataset {167297/Inf} /UniqueGlobalKey Group /UniqueGlobalKey/channel_id Group /UniqueGlobalKey/context_tags Group /UniqueGlobalKey/tracking_id Group

liuqianhn commented 4 years ago

Hi @qiuyixmm, the errors in the first two datasets are due to that the fast5 files are not basecalled and thus no fastq info. The error in NA12878 dataset is that the default basecalle under Basecall_1D_000 is incorrect:, but the correct basecall is under Basecall_1D_001-----one solution for this is to remove the basecall in those error fast5 files and re-basecalled; or to set --basecall_1d Basecall_1D_001 ONLY for those error fast5 files.