[BUG] RuntimeError: job:25d44bdc0fdbba7a8f78bbb09d2b7d9f64899dba 116625 failed 3 times

biglinn commented 6 months ago

Bug summary

When I perform dpgen run, there is an erro in the 'train' part. The job is always terminated. I look for train.log in the remote platform where 'train' part is performed. There seems to be an erro about tensorflow:

WARNING:tensorflow:From /home/biglinn/miniconda3/envs/dpmd227/lib/python3.11/site-packages/tensorflow/python/compat/v2_compat.py:108: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version

tensorflow.python.framework.errors_impl.InvalidArgumentError: Expected begin[1] in [0, 40], but got 655561818 [[{{node Slice_1}}]]

DeePMD-kit Version

2.2.7

TensorFlow Version

2.15.0(It is installed as dependency of deepmd-kit)

How did you download the software?

conda

Input Files, Running Commands, Error Log, etc.

machine.json param.json train.log I perform job with command > nohup dpgen run param.json machine.json > run.out &

train.log is log file of one model training.

Steps to Reproduce

After I performed > nohup dpgen run param.json machine.json > run.out &, I found model training job was terminated and then re-submitted. With doing this 3 times, dpgen run was terminated. I look for train.log in the work directory. There is some problem about tensorflow. Maybe tensorflow, I'm not sure.

Further Information, Files, and Links

No response

robinzyb commented 6 months ago

I found, in your log, the training has already finished. Could you try to restart the dpgen again?

njzjz commented 6 months ago

It's probably a GPU memory problem, for example, array out of bounds, which cannot be caught in each run.

biglinn commented 6 months ago

So, will it be complex to debug this? Well, I'd better use stable version for now.

njzjz commented 6 months ago

So, will it be complex to debug this?

Are you able to reproduce the error when running it a second time? If not, it will be hard to figure out the reason.

biglinn commented 6 months ago

So, will it be complex to debug this?

Are you able to reproduce the error when running it a second time? If not, it will be hard to figure out the reason.

I think it's okay to reprodece. When I met this erro, I repeatedly tried several times and got the same erro. And then, I change another stable version(deepmd-kit 2.2.7and tensorflow 2.9.0) which I install by the commond > "(base)$ conda create -n deepmd deepmd-kit==gpu libdeepmd==gpu lammps cudatoolkit=11.3 horovod -c https://conda.deepmodeling.org" on the deepmodeling website.

Now, I've been using this version. You say it's probably a GPU probelm, but this erro dosen't appear again.

I try to use the version of deepmd-kit 2.2.8 and tensroflow 2.15.0 again, and see whether this erro can be reproduced. I will give details later.

biglinn commented 6 months ago

So, will it be complex to debug this?

Are you able to reproduce the error when running it a second time? If not, it will be hard to figure out the reason.

I think it's okay to reprodece. When I met this erro, I repeatedly tried several times and got the same erro. And then, I change another stable version(deepmd-kit 2.2.7and tensorflow 2.9.0) which I install by the commond > "(base)$ conda create -n deepmd deepmd-kit==gpu libdeepmd==gpu lammps cudatoolkit=11.3 horovod -c https://conda.deepmodeling.org" on the deepmodeling website.

Now, I've been using this version. You say it's probably a GPU probelm, but this erro dosen't appear again.

I try to use the version of deepmd-kit 2.2.8 and tensroflow 2.15.0 again, and see whether this erro can be reproduced. I will give details later.

It's strange! I use the previous version for reproducing this erro. But this erro doesn't appear. It seems that it was an unexpected erro before.

njzjz commented 5 months ago

As shown below, I just met the same problem when training the water example.

DEEPMD INFO    saved checkpoint model.ckpt
DEEPMD INFO    batch    2100 training time 3.26 s, testing time 0.05 s, total wall time 3.42 s
DEEPMD INFO    batch    2200 training time 3.43 s, testing time 0.05 s, total wall time 3.52 s
DEEPMD INFO    batch    2300 training time 3.45 s, testing time 0.05 s, total wall time 3.54 s
DEEPMD INFO    batch    2400 training time 3.27 s, testing time 0.06 s, total wall time 3.36 s
Traceback (most recent call last):
  File "/home/jz748/anaconda3/envs/conda-forge/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1402, in _do_call
    return fn(*args)
           ^^^^^^^^^
  File "/home/jz748/anaconda3/envs/conda-forge/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1385, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jz748/anaconda3/envs/conda-forge/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1478, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.InvalidArgumentError: Expected begin[1] in [0, 192], but got -1610612736
         [[{{node Slice_3}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jz748/anaconda3/envs/conda-forge/bin/dp", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/jz748/codes/deepmd-kit/deepmd/main.py", line 805, in main
    deepmd_main(args)
  File "/home/jz748/codes/deepmd-kit/deepmd/tf/entrypoints/main.py", line 72, in main
    train_dp(**dict_args)
  File "/home/jz748/codes/deepmd-kit/deepmd/tf/entrypoints/train.py", line 153, in train
    _do_work(jdata, run_opt, is_compress)
  File "/home/jz748/codes/deepmd-kit/deepmd/tf/entrypoints/train.py", line 270, in _do_work
    model.train(train_data, valid_data)
  File "/home/jz748/codes/deepmd-kit/deepmd/tf/train/trainer.py", line 723, in train
    _, next_train_batch_list = run_sess(
                               ^^^^^^^^^
  File "/home/jz748/codes/deepmd-kit/deepmd/tf/utils/sess.py", line 31, in run_sess
    return sess.run(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jz748/anaconda3/envs/conda-forge/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 972, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jz748/anaconda3/envs/conda-forge/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1215, in _run
    results = self._do_run(handle, final_targets, final_fetches,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jz748/anaconda3/envs/conda-forge/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1395, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jz748/anaconda3/envs/conda-forge/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1421, in _do_call
    raise type(e)(node_def, op, message)  # pylint: disable=no-value-for-parameter
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:

Detected at node 'Slice_3' defined at (most recent call last):
    File "/home/jz748/anaconda3/envs/conda-forge/bin/dp", line 8, in <module>
    File "/home/jz748/codes/deepmd-kit/deepmd/main.py", line 805, in main
    File "/home/jz748/codes/deepmd-kit/deepmd/tf/entrypoints/main.py", line 72, in main
    File "/home/jz748/codes/deepmd-kit/deepmd/tf/entrypoints/train.py", line 153, in train
    File "/home/jz748/codes/deepmd-kit/deepmd/tf/entrypoints/train.py", line 265, in _do_work
    File "/home/jz748/codes/deepmd-kit/deepmd/tf/train/trainer.py", line 310, in build
    File "/home/jz748/codes/deepmd-kit/deepmd/tf/train/trainer.py", line 387, in _build_network
    File "/home/jz748/codes/deepmd-kit/deepmd/tf/model/ener.py", line 240, in build
    File "/home/jz748/codes/deepmd-kit/deepmd/tf/fit/ener.py", line 638, in build
    File "/home/jz748/codes/deepmd-kit/deepmd/tf/common.py", line 258, in wrapper
    File "/home/jz748/codes/deepmd-kit/deepmd/tf/fit/ener.py", line 372, in _build_lower
Node: 'Slice_3'
Expected begin[1] in [0, 192], but got -1610612736
         [[{{node Slice_3}}]]

Original stack trace for 'Slice_3':
  File "/home/jz748/anaconda3/envs/conda-forge/bin/dp", line 8, in <module>
  File "/home/jz748/codes/deepmd-kit/deepmd/main.py", line 805, in main
  File "/home/jz748/codes/deepmd-kit/deepmd/tf/entrypoints/main.py", line 72, in main
  File "/home/jz748/codes/deepmd-kit/deepmd/tf/entrypoints/train.py", line 153, in train
  File "/home/jz748/codes/deepmd-kit/deepmd/tf/entrypoints/train.py", line 265, in _do_work
  File "/home/jz748/codes/deepmd-kit/deepmd/tf/train/trainer.py", line 310, in build
  File "/home/jz748/codes/deepmd-kit/deepmd/tf/train/trainer.py", line 387, in _build_network
  File "/home/jz748/codes/deepmd-kit/deepmd/tf/model/ener.py", line 240, in build
  File "/home/jz748/codes/deepmd-kit/deepmd/tf/fit/ener.py", line 638, in build
  File "/home/jz748/codes/deepmd-kit/deepmd/tf/common.py", line 258, in wrapper
  File "/home/jz748/codes/deepmd-kit/deepmd/tf/fit/ener.py", line 372, in _build_lower
  File "/home/jz748/anaconda3/envs/conda-forge/lib/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler
  File "/home/jz748/anaconda3/envs/conda-forge/lib/python3.11/site-packages/tensorflow/python/util/dispatch.py", line 1260, in op_dispatch_handler
  File "/home/jz748/anaconda3/envs/conda-forge/lib/python3.11/site-packages/tensorflow/python/ops/array_ops.py", line 1223, in slice
  File "/home/jz748/anaconda3/envs/conda-forge/lib/python3.11/site-packages/tensorflow/python/ops/gen_array_ops.py", line 9825, in _slice
  File "/home/jz748/anaconda3/envs/conda-forge/lib/python3.11/site-packages/tensorflow/python/framework/op_def_library.py", line 796, in _apply_op_helper
  File "/home/jz748/anaconda3/envs/conda-forge/lib/python3.11/site-packages/tensorflow/python/framework/ops.py", line 2652, in _create_op_internal
  File "/home/jz748/anaconda3/envs/conda-forge/lib/python3.11/site-packages/tensorflow/python/framework/ops.py", line 1160, in from_node_def

However, it cannot be reproduced in a second run.

TensorFlow is from conda-forge, as shown below.

tensorflow                2.15.0          cuda120py311h5cbd639_3    conda-forge
tensorflow-base           2.15.0          cuda120py311h43b5e44_3    conda-forge
tensorflow-estimator      2.15.0          cuda120py311hf663016_3    conda-forge

deepmodeling / deepmd-kit