PengNi / deepsignal2

GNU General Public License v3.0
27 stars 4 forks source link

Large percentage of fast5 reads failing call_mods #5

Closed coeyct closed 3 years ago

coeyct commented 3 years ago

Hi, I'm in the process of trying to push some test data through the full pipeline and wondered if the following output I received was at all normal or likely a problem with how my data is set up, or how the modules might have been installed. I had the following input commands and output from running the "call_mods" command on version 2/0.1.0.

The single-read fast5s are listed in a number of separate directories, but these are all found under the "singles" directory that I listed as the input path, so I doubt that's the problem. Plus, at least 12000 of the files seemed to have worked fine, so it must have looked at reads from multiple folders to do that. I suppose I could be using the wrong model file, but the one I used is the most recent release I could find.

I looked through a number of other issues that were closed and couldn't find something that explicitly mentioned what would be causing the "mean of empty slice" or the "invalid value encountered in double scalars" errors, so I don't know where to look for a potential fix. Any help you can provide would be appreciated.

[coeyct@cn2356 workspace]$ deepsignal2 call_mods --input_path singles/ --model_path singles/model/model.dp2.CG.R9.4_1D.human_hx1.bn17_sn16.both_bilstm.b17_s16_epoch4.ckpt --result_file CpG.callmods.tsv --reference_path singles/GCF_000001405.39_GRCh38.p13_genomic.fna --corrected_group RawGenomeCorrected_001

===============================================

parameters:

input_path: singles/ model_path: singles/model/model.dp2.CG.R9.4_1D.human_hx1.bn17_sn16.both_bilstm.b17_s16_epoch4.ckpt model_type: both_bilstm seq_len: 17 signal_len: 16 layernum1: 3 layernum2: 1 class_num: 2 dropout_rate: 0 n_vocab: 16 n_embed: 4 is_base: yes is_signallen: yes batch_size: 512 hid_rnn: 256 result_file: CpG.callmods.tsv recursively: yes corrected_group: RawGenomeCorrected_001 basecall_subgroup: BaseCalled_template reference_path: singles/GCF_000001405.39_GRCh38.p13_genomic.fna is_dna: yes normalize_method: mad methy_label: 1 motifs: CG mod_loc: 0 f5_batch_size: 20 positions: None nproc: 10 nproc_gpu: 2

===============================================

[main]call_mods starts.. 40000 fast5 files in total.. parse the motifs string.. read genome reference file.. read position file if it is not None.. call_mods process-18980 starts write_process-18981 starts read_fast5 process-18972 starts read_fast5 process-18976 starts read_fast5 process-18977 starts read_fast5 process-18973 starts read_fast5 process-18974 starts read_fast5 process-18978 starts read_fast5 process-18975 starts call_mods process-18979 starts /opt/conda/envs/app/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3420: RuntimeWarning: Mean of empty slice. out=out, kwargs) /opt/conda/envs/app/lib/python3.7/site-packages/numpy/core/_methods.py:188: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) /opt/conda/envs/app/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3420: RuntimeWarning: Mean of empty slice. out=out, kwargs) /opt/conda/envs/app/lib/python3.7/site-packages/numpy/core/_methods.py:188: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) /opt/conda/envs/app/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3420: RuntimeWarning: Mean of empty slice. out=out, kwargs) /opt/conda/envs/app/lib/python3.7/site-packages/numpy/core/_methods.py:188: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) /opt/conda/envs/app/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3420: RuntimeWarning: Mean of empty slice. out=out, kwargs) /opt/conda/envs/app/lib/python3.7/site-packages/numpy/core/_methods.py:188: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) read_fast5 process-18978 ending, proceed 5740 fast5s read_fast5 process-18976 ending, proceed 5740 fast5s read_fast5 process-18973 ending, proceed 5700 fast5s read_fast5 process-18972 ending, proceed 5700 fast5s read_fast5 process-18974 ending, proceed 5780 fast5s read_fast5 process-18977 ending, proceed 5660 fast5s read_fast5 process-18975 ending, proceed 5680 fast5s call_mods process-18980 ending, proceed 924 batches call_mods process-18979 ending, proceed 977 batches write_process-18981 finished 27321 of 40000 fast5 files failed.. [main]call_mods costs 791.46 seconds.. [coeyct@cn2356 workspace]$ call_modification_frequency.py \

--input_path CpG.callmods.tsv \ --result_file CpG.call.mods.frequency.tsv get 1 input file(s).. reading the input files.. 100.00% (801480 of 801480) calls used.. writing the result.. [coeyct@cn2356 workspace]$ call_modification_frequency.py --input_path CpG.callmods.tsv --result_file CpG.call.mods.frequency.bed --bed get 1 input file(s).. reading the input files.. 100.00% (801480 of 801480) calls used.. writing the result..

PengNi commented 3 years ago

Hi @coeyct , the commands you used seems fine to me. You can check the log of tombo resquiggle of your fast5 files. My guess is that this group of fast5 files have relatively low quality and tombo resquiggle failed to process most of them.

Best, Peng

coeyct commented 3 years ago

Thanks for the comment. I don't know if the read quality is necessarily the problem. The Minion MK1C instrument allegedly generates reads that are typically above a Q-Score of 7 -- usually more like a Q-Score of 10 -- so I tried doing the Resquiggle with a Q-Score of 8 added to the commands. I wasn't able to get the Resquiggle log before because the run was taking too long and Terminal quit updating, but here is what I got with a smaller subsection of files (keep in mind these are not the exact same FAST5s as above, but they are from the same data set and should be of relatively similar quality):

[coeyct@cn2368 20210404_02384_r1]$ tombo resquiggle retest/SRFast5_pass retest/GCF_000001405.39_GRCh38.p13_genomic.fna --processes 6 --corrected-group RawGenomeCorrected_001 --q-score 8 --overwrite --ignore-read-locks [13:01:51] Loading minimap2 reference. [13:03:21] Getting file list. [13:03:21] Loading default canonical DNA model. [13:03:23] Re-squiggling reads (raw signal to genomic sequence alignment). 100%|█████████████████████████████████████| 12000/12000 [26:52<00:00, 7.44it/s] [13:30:16] Final unsuccessful reads summary (8.8% reads unsuccessfully processed; 1050 total reads): 4.4% ( 526 reads) : Alignment not produced
2.5% ( 304 reads) : Read event to sequence alignment extends beyond bandwidth
1.6% ( 194 reads) : Poor raw to expected signal matching (revert with tombo filter clear_filters) 0.2% ( 20 reads) : Reference mapping contains non-canonical bases (transcriptome reference cannot contain U bases) 0.1% ( 6 reads) : Read filtered by q-score.

So you can see the Q-score flag did not affect much. Also, the 8-9% rejection rate is actually well in line with what I have seen when I've done alignments of these data sets using the Nanopore's cloud-based processes, so that much unsuccessful data was to be expected.

However, when I completed the final step, it rejected all 12000 reads from this particular attempt!

[coeyct@cn2368 20210404_02384_r1]$ deepsignal2 call_mods --input_path retest/SRFast5_pass/ --model_path retest/model.dp2.CG.R9.4_1D.human_hx1.bn17_sn16.both_bilstm.b17_s16_epoch4.ckpt --result_file CpG20-22.call.mods.tsv --reference_path retest/GCF_000001405.39_GRCh38.p13_genomic.fna

===============================================

parameters:

input_path: retest/SRFast5_pass/ model_path: retest/model.dp2.CG.R9.4_1D.human_hx1.bn17_sn16.both_bilstm.b17_s16_epoch4.ckpt model_type: both_bilstm seq_len: 17 signal_len: 16 layernum1: 3 layernum2: 1 class_num: 2 dropout_rate: 0 n_vocab: 16 n_embed: 4 is_base: yes is_signallen: yes batch_size: 512 hid_rnn: 256 result_file: CpG20-22.call.mods.tsv recursively: yes corrected_group: RawGenomeCorrected_000 basecall_subgroup: BaseCalled_template reference_path: retest/GCF_000001405.39_GRCh38.p13_genomic.fna is_dna: yes normalize_method: mad methy_label: 1 motifs: CG mod_loc: 0 f5_batch_size: 20 positions: None nproc: 10 nproc_gpu: 2

===============================================

[main]call_mods starts.. 12000 fast5 files in total.. parse the motifs string.. read genome reference file.. read position file if it is not None.. write_process-51017 starts call_mods process-51016 starts read_fast5 process-51004 starts call_mods process-51011 starts read_fast5 process-51009 starts read_fast5 process-51005 starts read_fast5 process-51006 starts read_fast5 process-51007 starts read_fast5 process-51008 starts read_fast5 process-51010 starts /opt/conda/envs/app/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3420: RuntimeWarning: Mean of empty slice. out=out, kwargs) /opt/conda/envs/app/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3420: RuntimeWarning: Mean of empty slice. out=out, kwargs) /opt/conda/envs/app/lib/python3.7/site-packages/numpy/core/_methods.py:188: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) /opt/conda/envs/app/lib/python3.7/site-packages/numpy/core/_methods.py:188: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) read_fast5 process-51007 ending, proceed 1780 fast5s read_fast5 process-51010 ending, proceed 1620 fast5s read_fast5 process-51008 ending, proceed 1600 fast5s read_fast5 process-51009 ending, proceed 1780 fast5s read_fast5 process-51006 ending, proceed 1800 fast5s read_fast5 process-51005 ending, proceed 1660 fast5s read_fast5 process-51004 ending, proceed 1760 fast5s call_mods process-51011 ending, proceed 0 batches call_mods process-51016 ending, proceed 0 batches write_process-51017 finished 12000 of 12000 fast5 files failed.. [main]call_mods costs 73.41 seconds..

My hunch is that perhaps I am using the wrong model file to work along with the reference genome & FAST5s that I have. Does that make sense, and is there another model file I should try? Thanks again.

PengNi commented 3 years ago

Hi @coeyct , the problem may be that the value of --corrected_group in deepsignal2 call_mods is not the same with the one in tombo resquiggle. Try to set --corrected_group RawGenomeCorrected_001 in deepsignal2 call_mods.

Best, Peng

coeyct commented 3 years ago

OK, I just re-ran this data - think I had typed "corrected-group" instead of "corrected_group" the last time, which is why that command had been removed - and only got 856 rejected reads out of 12000. The prior time I'd attempted this, I had 10800 reads rejected, so I'm not really certain why this worked so much better this time than last. I will retry with the larger data set and see how it goes.

coeyct commented 3 years ago

OK, I think the issue was to do with "corrected group" as mentioned above: in Tombo, it asks for a hyphen, but the same command in deepsignal asks for an underscore. It's possible I got tripped up by this incongruity. Nothing else I've run so far has thrown back as many failed reads, so I'm pretty certain this was the cause. Thanks.