bioinfomaticsCSU / deepsignal

Detecting methylation using signal-level features from Nanopore sequencing reads
GNU General Public License v3.0
108 stars 21 forks source link

Problem while running deepsignal on GPU #38

Closed stefanucci-luca closed 4 years ago

stefanucci-luca commented 4 years ago

Hi PengNi,

I am trying to run deepsignal on our HPC GPU, but I get this error:

# ===============================================
## parameters: 
input_path:
    /home/ls760/nanopore/us/scripts/test_area/ls760/methylation_pipeline/VWD1047/tmp
model_path:
    /rds/project/who1000/rds-who1000-wgs10k/WGS10K/data/projects/nanopore/us/resources/methylation_models/deepsignal_human/model.CpG.R9.4_1D.human_hx1.bn17.sn360/bn_17.sn_360.epoch_7.ckpt
is_cnn:
    yes
is_rnn:
    yes
is_base:
    yes
kmer_len:
    17
cent_signals_len:
    360
batch_size:
    512
learning_rate:
    0.001
class_num:
    2
result_file:
    Nanopore_methylationanalysis.tsv_call_mods.tsv
recursively:
    yes
corrected_group:
    RawGenomeCorrected_000
basecall_subgroup:
    BaseCalled_template
reference_path:
    /rds/project/who1000/rds-who1000-wgs10k/WGS10K/data/projects/nanopore/us/scripts/test_area/ls760/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
is_dna:
    yes
normalize_method:
    mad
methy_label:
    1
motifs:
    CG
mod_loc:
    0
f5_batch_num:
    100
positions:
    None
nproc:
    10
is_gpu:
    yes
# ===============================================
898913 fast5 files in total..
parse the motifs string..
read genome reference file..
read position file if it is not None..
/home/ls760/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/ls760/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/ls760/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:521: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/ls760/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:522: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/ls760/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/ls760/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
write_process started..
2020-03-03 14:44:05.613202: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-03-03 14:44:24.483271: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key modelem/bw/multi_rnn_cell/cell_0/lstm_cell/bias not found in checkpoint
Process Process-9:
Traceback (most recent call last):
  File "/home/ls760/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1322, in _do_call
    return fn(*args)
  File "/home/ls760/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/ls760/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: Key modelem/bw/multi_rnn_cell/cell_0/lstm_cell/bias not found in checkpoint
     [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ls760/nanopore/us/resources/envs/ont/deepsignalenv_gpu/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/ls760/nanopore/us/resources/envs/ont/deepsignalenv_gpu/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/rds/project/who1000/rds-who1000-wgs10k/WGS10K/data/projects/nanopore/us/resources/envs/ont/deepsignalenv_gpu/lib/python3.6/site-packages/deepsignal-0.1.7-py3.6.egg/deepsignal/call_modifications.py", line 171, in _call_mods_q
    saver.restore(sess, model_path)
  File "/home/ls760/.local/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1802, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/home/ls760/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/home/ls760/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/ls760/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/home/ls760/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key modelem/bw/multi_rnn_cell/cell_0/lstm_cell/bias not found in checkpoint
     [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Caused by op 'save/RestoreV2', defined at:
  File "/rds/project/who1000/rds-who1000-wgs10k/WGS10K/data/projects/nanopore/us/resources/envs/ont/deepsignalenv_gpu/bin/deepsignal", line 11, in <module>
    load_entry_point('deepsignal==0.1.7', 'console_scripts', 'deepsignal')()
  File "/rds/project/who1000/rds-who1000-wgs10k/WGS10K/data/projects/nanopore/us/resources/envs/ont/deepsignalenv_gpu/lib/python3.6/site-packages/deepsignal-0.1.7-py3.6.egg/deepsignal/deepsignal.py", line 423, in main
    args.func(args)
  File "/rds/project/who1000/rds-who1000-wgs10k/WGS10K/data/projects/nanopore/us/resources/envs/ont/deepsignalenv_gpu/lib/python3.6/site-packages/deepsignal-0.1.7-py3.6.egg/deepsignal/deepsignal.py", line 87, in main_call_mods
    f5_args)
  File "/rds/project/who1000/rds-who1000-wgs10k/WGS10K/data/projects/nanopore/us/resources/envs/ont/deepsignalenv_gpu/lib/python3.6/site-packages/deepsignal-0.1.7-py3.6.egg/deepsignal/call_modifications.py", line 393, in call_mods
    is_rnn, is_base, is_cnn)
  File "/rds/project/who1000/rds-who1000-wgs10k/WGS10K/data/projects/nanopore/us/resources/envs/ont/deepsignalenv_gpu/lib/python3.6/site-packages/deepsignal-0.1.7-py3.6.egg/deepsignal/call_modifications.py", line 339, in _call_mods_from_fast5s_gpu
    p_call_mods_gpu.start()
  File "/home/ls760/nanopore/us/resources/envs/ont/deepsignalenv_gpu/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/home/ls760/nanopore/us/resources/envs/ont/deepsignalenv_gpu/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/home/ls760/nanopore/us/resources/envs/ont/deepsignalenv_gpu/lib/python3.6/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/home/ls760/nanopore/us/resources/envs/ont/deepsignalenv_gpu/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/ls760/nanopore/us/resources/envs/ont/deepsignalenv_gpu/lib/python3.6/multiprocessing/popen_fork.py", line 73, in _launch
    code = process_obj._bootstrap()
  File "/home/ls760/nanopore/us/resources/envs/ont/deepsignalenv_gpu/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/ls760/nanopore/us/resources/envs/ont/deepsignalenv_gpu/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/rds/project/who1000/rds-who1000-wgs10k/WGS10K/data/projects/nanopore/us/resources/envs/ont/deepsignalenv_gpu/lib/python3.6/site-packages/deepsignal-0.1.7-py3.6.egg/deepsignal/call_modifications.py", line 170, in _call_mods_q
    saver = tf.train.Saver()
  File "/home/ls760/.local/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1338, in __init__
    self.build()
  File "/home/ls760/.local/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1347, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/home/ls760/.local/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1384, in _build
    build_save=build_save, build_restore=build_restore)
  File "/home/ls760/.local/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 835, in _build_internal
    restore_sequentially, reshape)
  File "/home/ls760/.local/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 472, in _AddRestoreOps
    restore_sequentially)
  File "/home/ls760/.local/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 886, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/home/ls760/.local/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1463, in restore_v2
    shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
  File "/home/ls760/.local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/ls760/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
    op_def=op_def)
  File "/home/ls760/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1718, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

NotFoundError (see above for traceback): Key modelem/bw/multi_rnn_cell/cell_0/lstm_cell/bias not found in checkpoint
     [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Searching a bit on the internet looks like it is a model problem. would oyu agree? do you have any idea on how to solve it?

Thanks, Luca

PengNi commented 4 years ago

Hi @stefanucci-luca ,

Thanks for your interest. It looks like you are using deepsignal v0.1.7. the model model.CpG.R9.4_1D.human_hx1.bn17.sn360.v0.1.7+.tar.gz (google drive) should be used.

Best, Peng