PengNi / deepsignal-plant

Detecting methylation using signal-level features from Nanopore sequencing reads of plants
GNU General Public License v3.0
57 stars 12 forks source link

call_mods hangs indefinitely #28

Closed SwiftSeal closed 1 year ago

SwiftSeal commented 1 year ago

Hello,

I'm running deepsignal plant on a 750m plant genome with approx 40x coverage ONT reads. I ran:

CUDA_VISIBLE_DEVICES=0 deepsignal_plant call_mods --input_path fast5s_single/ \
  --model_path model.dp2.CNN.arabnrice2-1_120m_R9.4plus_tem.bn13_sn16.both_bilstm.epoch6.ckpt \
  --result_file fast5s.C.call_mods.tsv \
  --corrected_group RawGenomeCorrected_000 \
  --motifs C --nproc 30 --nproc_gpu 6

a few days ago. It began successfully - so far the fast5s.C.call_mods.tsv file is 138G. It has now stopped and is no longer writing to the mods file.

There is a single process still running, which is consuming 100% CPU according to htop. It also periodically launches a process but this dies before I can see what it is.

This is the current output of deepsignal:

# ===============================================
## parameters:
input_path:
        fast5s_single/
f5_batch_size:
        30
model_path:
        model.dp2.CNN.arabnrice2-1_120m_R9.4plus_tem.bn13_sn16.both_bilstm.epoch6.ckpt
model_type:
        both_bilstm
seq_len:
        13
signal_len:
        16
layernum1:
        3
layernum2:
        1
class_num:
        2
dropout_rate:
        0
n_vocab:
        16
n_embed:
        4
is_base:
        yes
is_signallen:
        yes
batch_size:
        512
hid_rnn:
        256
result_file:
        fast5s.C.call_mods.tsv
gzip:
        False
recursively:
        yes
corrected_group:
        RawGenomeCorrected_000
basecall_subgroup:
        BaseCalled_template
is_dna:
        yes
normalize_method:
        mad
motifs:
        C
mod_loc:
        0
region:
        None
positions:
        None
reference_path:
        None
nproc:
        30
nproc_gpu:
        6
# ===============================================
[main] call_mods starts..
cuda availability: True
7596528 fast5 files in total..
parse the motifs string..

I ran the example data successfully so not too sure why this has happened! The only other possible reason I can see is that tombo resquiggle experienced an error while running:

[09:44:21] Loading minimap2 reference.
[09:44:38] Getting file list.
[09:46:45] Loading default canonical ***** DNA ***** model.
[09:46:48] Re-squiggling reads (raw signal to genomic sequence alignment).
100%|██████████| 7596528/7596528 [80:18:14<00:00, 26.28it/s]
******************** WARNING ********************
        Unexpected errors occured. See full error stack traces for first (up to) 50 errors in "unexpected_tombo_errors.6920.err"
[18:05:02] Final unsuccessful reads summary (19.7% reads unsuccessfully processed; 1493848 total reads):
    14.0% (1061933 reads) : Alignment not produced
     4.0% ( 305247 reads) : Poor raw to expected signal matching (revert with `tombo filter clear_filters`)
     1.3% (  95189 reads) : Read event to sequence alignment extends beyond bandwidth
     0.4% (  31099 reads) : Reference mapping contains non-canonical bases (transcriptome reference cannot contain U bases)
     0.0% (    184 reads) : Fewer changepoints found than requested
     0.0% (     69 reads) : Read failed sequence-based signal re-scaling parameter estimation.
     0.0% (     64 reads) : Not enough raw signal around potential genomic deletion(s)
     0.0% (     51 reads) : Too much raw signal for mapped sequence
     0.0% (      9 reads) : Unexpected error
     0.0% (      3 reads) : Read contains too many potential genomic deletions
[18:05:02] Saving Tombo reads index to file.

Could the tombo errors affect the deepsignal run? Happy to share any other information needed.

Thanks in advance!

SwiftSeal commented 1 year ago

A closer look at the processes, turns out there are a lot more running but have all died. From htop:

    PID USER       PRI  NI  VIRT   RES   SHR S  CPU%▽MEM%   TIME+  Command
   2820 username      25   5 7739M 1026M  4292 R  99.5  0.5 83h09:48 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 /mnt/shared/scratch/username/apps/conda/env
   3746 username      25   5 13.0G 3749M  133M S   0.6  2.0 28h25:59 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spaw
   2991 username      25   5 27668  8136  1592 S   0.0  0.0  0:00.13 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.resource_tracker
   3722 username      25   5 8079M 1398M 10992 S   0.0  0.7 34h27:11 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spaw
   3723 username      25   5 7567M  890M 10920 S   0.0  0.5 34h38:54 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spaw
   3724 username      25   5 7578M  898M 11032 S   0.6  0.5 34h40:16 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spaw
   3725 username      25   5 7440M  761M 10944 S   0.0  0.4 34h30:01 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spaw
   3726 username      25   5 7329M  661M 10984 S   0.0  0.3 34h32:01 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spaw
   3728 username      25   5 7735M 1059M 11052 S   0.0  0.6 34h30:41 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spaw
   3729 username      25   5 8121M 1445M 11188 S   0.0  0.8 34h30:50 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spaw
   3730 username      25   5 7631M  953M 11108 S   0.0  0.5 34h23:07 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spaw
   3731 username      25   5 7664M  992M 11108 S   0.0  0.5 34h28:03 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spaw
   3732 username      25   5 7541M  866M 11000 S   0.0  0.5 34h16:46 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spaw
   3733 username      25   5 7601M  930M 11136 S   0.0  0.5 34h32:43 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spaw
   3734 username      25   5 7385M  712M 11148 S   0.0  0.4 34h19:38 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spaw
   3735 username      25   5 8585M 1892M 10932 S   0.0  1.0 34h36:07 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spaw
   3736 username      25   5 7693M 1024M 11068 S   0.0  0.5 34h29:37 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spaw
   3737 username      25   5 8487M 1815M 11100 S   0.0  0.9 34h48:06 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spaw
   3738 username      25   5 7971M 1295M 11152 S   0.0  0.7 34h24:11 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spaw
   3739 username      25   5 7377M  697M 11048 S   0.0  0.4 34h30:45 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spaw
   3740 username      25   5 8202M 1524M 11116 S   0.0  0.8 34h30:49 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spaw
   3741 username      25   5 7377M  706M 11144 S   0.0  0.4 34h17:47 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spaw
   3742 username      25   5     0     0     0 Z   0.0  0.0 21h59:12 python3.9
   3743 username      25   5 7696M 1014M 11004 S   0.0  0.5 34h31:07 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spaw
   3744 username      25   5 8308M 1641M 11164 S   0.0  0.9 34h27:05 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spaw
   3745 username      25   5 8057M 1378M 11148 S   0.0  0.7 34h18:14 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spaw
   3746 username      25   5 13.0G 3749M  133M S   0.0  2.0 28h25:59 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spaw
   3747 username      25   5     0     0     0 Z   0.0  0.0 27h33:19 python3.9
   3748 username      25   5     0     0     0 Z   0.0  0.0 27h37:08 python3.9
   3749 username      25   5 13.2G 3856M  133M S   0.0  2.0 28h14:12 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spaw
   3750 username      25   5     0     0     0 Z   0.0  0.0 16h00:46 python3.9
   3751 username      25   5     0     0     0 Z   0.0  0.0  6h36:29 python3.9
   3753 username      25   5 6771M  176M  2776 S   0.0  0.1 16:43.84 /mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spaw

The full cmd of the processes that keep relaunching:

/mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=14, pipe_handle=68) --multiprocessing-fork
/mnt/shared/scratch/username/apps/conda/envs/deepsignalpenv/bin/python3.9 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=14, pipe_handle=74) --multiprocessing-fork
SwiftSeal commented 1 year ago

I have a feeling this might be due to memory caps on our SLURM system. Just looking at issue https://github.com/PengNi/deepsignal-plant/issues/23, I was running 32 nprocs on only 60G mem, so likely to have exceeded that?

I've capped it to 16 procs and relaunched it to see if that freezes at the same point, I'll close this if it doesn't freeze at the same point :)

PengNi commented 1 year ago

@SwiftSeal , thank you very much for using deepsignal-plant! You can also try to set a smaller --f5_batch_size and a smaller --batch_size to reduce the memory of each process.

Best, Peng

SwiftSeal commented 1 year ago

It was a mem issue, ran fine once I gave it enough room :) I'll close this now, great piece of software!