deepmodeling / deepmd-kit

A deep learning package for many-body potential energy representation and molecular dynamics
https://docs.deepmodeling.com/projects/deepmd/
GNU Lesser General Public License v3.0
1.5k stars 512 forks source link

one error to compress a model #795

Closed baoqinfu closed 3 years ago

baoqinfu commented 3 years ago

*Summary I use "dp compress -i SiClda1.pb input.json -o SiCldaC1.pb" to compress a model and it goes wrong. How can I adress this problem?

-----------------------------------------------------------------------------------------------------------------------------------------
wanrun dp_lda1_test $ dp compress -i SiClda1.pb input.json -o SiCldaC1.pb
WARNING:tensorflow:From /home/wanrun/denghui/dp-api/tensorflow_venv/lib/python3.7/site-papython/compat/v2_compat.py:61: disable_resource_variables (from tensorflow.python.ops.varprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
DEEPMD INFO    |-> deepmd.entrypoints.compress                   

DEEPMD INFO    |-> deepmd.entrypoints.compress                   stage 1: train or refinebulation
DEEPMD INFO    |-> deepmd.entrypoints.train                       _____               ___           _     _  _   
DEEPMD INFO    |-> deepmd.entrypoints.train                      |  __ \             |  _\         | |   (_)| |  
DEEPMD INFO    |-> deepmd.entrypoints.train                      | |  | |  ___   ___ | |_ | ______ | | __ _ | |_ 
DEEPMD INFO    |-> deepmd.entrypoints.train                      | |  | | / _ \ / _ \|  _ ||______|| |/ /| || __|
DEEPMD INFO    |-> deepmd.entrypoints.train                      | |__| ||  __/|  __/| |  |        |   < | || |_ 
DEEPMD INFO    |-> deepmd.entrypoints.train                      |_____/  \___| \___||_| /         |_|\_\|_| \__|
DEEPMD INFO    |-> deepmd.entrypoints.train                      Please read and cite:
DEEPMD INFO    |-> deepmd.entrypoints.train                      Wang, Zhang, Han and E, 228, 178-184 (2018)
DEEPMD INFO    |-> deepmd.entrypoints.train                      installed to:         /t0mzgx2n/_skbuild/linux-x86_64-3.7/cmake-install
DEEPMD INFO    |-> deepmd.entrypoints.train                      source :              v1dirty
DEEPMD INFO    |-> deepmd.entrypoints.train                      source brach:         ap
DEEPMD INFO    |-> deepmd.entrypoints.train                      source commit:        5d
DEEPMD INFO    |-> deepmd.entrypoints.train                      source commit at:     20+0800
DEEPMD INFO    |-> deepmd.entrypoints.train                      build float prec:     fl
DEEPMD INFO    |-> deepmd.entrypoints.train                      build with tf inc:    /h/dp-api/tensorflow_venv/lib/python3.7/site-packages/tensorflow/include;/home/wanrun/denghow_venv/lib/python3.7/site-packages/tensorflow/include
DEEPMD INFO    |-> deepmd.entrypoints.train                      build with tf lib:    
DEEPMD INFO    |-> deepmd.run_options                            ---Summary of the traini-----------------------
DEEPMD INFO    |-> deepmd.run_options                            running on:           lo
DEEPMD INFO    |-> deepmd.run_options                            CUDA_VISIBLE_DEVICES: un
DEEPMD INFO    |-> deepmd.run_options                            num_intra_threads:    0
DEEPMD INFO    |-> deepmd.run_options                            num_inter_threads:    0
DEEPMD INFO    |-> deepmd.run_options                            -----------------------------------------------
2021-06-24 14:25:13.080366: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed KNOWN ERROR (303)
/home/wanrun/denghui/dp-api/tensorflow_venv/lib/python3.7/site-packages/deepmd/utils/dataerWarning: system ../data.init/C_atomSpin2/dpmd required batch size is larger than the si../data.init/C_atomSpin2/dpmd/set.000 (32 > 1)
  (self.system_dirs[ii], chk_ret[0], self.batch_size[ii], chk_ret[1]))
/home/wanrun/denghui/dp-api/tensorflow_venv/lib/python3.7/site-packages/deepmd/utils/dataerWarning: system ../data.init/C_atomSpin2/dpmd required test size is larger than the siz./data.init/C_atomSpin2/dpmd/set.000 (2 > 1)
  (self.system_dirs[ii], chk_ret[0], self.test_size[ii], chk_ret[1]))
/home/wanrun/denghui/dp-api/tensorflow_venv/lib/python3.7/site-packages/deepmd/utils/dataerWarning: system ../data.init/Si_atomSpin2/dpmd required batch size is larger than the s ../data.init/Si_atomSpin2/dpmd/set.000 (32 > 1)
  (self.system_dirs[ii], chk_ret[0], self.batch_size[ii], chk_ret[1]))
/home/wanrun/denghui/dp-api/tensorflow_venv/lib/python3.7/site-packages/deepmd/utils/dataerWarning: system ../data.init/Si_atomSpin2/dpmd required test size is larger than the si../data.init/Si_atomSpin2/dpmd/set.000 (2 > 1)
  (self.system_dirs[ii], chk_ret[0], self.test_size[ii], chk_ret[1]))
DEEPMD INFO    |-> deepmd.utils.data_system                      ---Summary of DataSystem------------------------------
DEEPMD INFO    |-> deepmd.utils.data_system                      found 36 system(s):
DEEPMD INFO    |-> deepmd.utils.data_system                                                
DEEPMD INFO    |-> deepmd.utils.data_system                      natoms  bch_sz  n_bch   
DEEPMD INFO    |-> deepmd.utils.data_system                                   ../data.ini       1      32       1       2  0.000
DEEPMD INFO    |-> deepmd.utils.data_system                                  ../data.init       1      32       1       2  0.000
DEEPMD INFO    |-> deepmd.utils.data_system                      -- H-4.02x02x02/02.md/sy      32       1     600       2  0.208
DEEPMD INFO    |-> deepmd.utils.data_system                      -- C-2.03x03x03/02.md/sy      54       1     600       2  0.208
DEEPMD INFO    |-> deepmd.utils.data_system                      -- H-8.02x02x01/02.md/sy      32       1     600       2  0.208
DEEPMD INFO    |-> deepmd.utils.data_system                      -- -12.02x02x01/02.md/sy      48       1     600       2  0.208
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      32       1       7       2  0.002
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      54       1      21       2  0.007
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      32       1      17       2  0.006
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      48       1      19       2  0.007
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      54       1       8       2  0.003
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      32       1       5       2  0.002
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      48       1       8       2  0.003
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      32       1      15       2  0.005
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      54       1      12       2  0.004
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      32       1      10       2  0.003
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      48       1      12       2  0.004
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      32       1      16       2  0.006
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      32       1       6       2  0.002
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      32       1       5       2  0.002
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      32       1      22       2  0.008
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      54       1      34       2  0.012
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      32       1      18       2  0.006
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      48       1      19       2  0.007
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      32       1       6       2  0.002
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      54       1       6       2  0.002
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      32       1       6       2  0.002
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      48       1       5       2  0.002
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      32       1      27       2  0.009
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      54       1      19       2  0.007
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      32       1      15       2  0.005
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      48       1      20       2  0.007
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      32       1      39       2  0.014
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      54       1       9       2  0.003
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      32       1      30       2  0.010
DEEPMD INFO    |-> deepmd.utils.data_system                        ../data.iters/iter.000      48       1      40       2  0.014
DEEPMD INFO    |-> deepmd.utils.data_system                      ------------------------------------------------------

DEEPMD INFO    |-> deepmd.trainer                                training without frame p
2021-06-24 14:25:15.792888: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] : Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was nnt XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To s active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via et the envvar XLA_FLAGS=--xla_hlo_profile.
Traceback (most recent call last):
  File "/home/wanrun/denghui/dp-api/tensorflow_venv/bin/dp", line 8, in <module>
    sys.exit(main())
  File "/home/wanrun/denghui/dp-api/tensorflow_venv/lib/python3.7/site-packages/deepmd/main main
    compress(**dict_args)
  File "/home/wanrun/denghui/dp-api/tensorflow_venv/lib/python3.7/site-packages/deepmd/en.py", line 102, in compress
    log_path=log_path,
  File "/home/wanrun/denghui/dp-api/tensorflow_venv/lib/python3.7/site-packages/deepmd/en", line 211, in train
    _do_work(jdata, run_opt)
  File "/home/wanrun/denghui/dp-api/tensorflow_venv/lib/python3.7/site-packages/deepmd/en", line 291, in _do_work
    model.build(data, stop_batch)
  File "/home/wanrun/denghui/dp-api/tensorflow_venv/lib/python3.7/site-packages/deepmd/tr4, in build
    = self.neighbor_stat.get_stat(data)
  File "/home/wanrun/denghui/dp-api/tensorflow_venv/lib/python3.7/site-packages/deepmd/utpy", line 85, in get_stat
    dt = np.min(dt)
  File "/home/wanrun/denghui/dp-api/tensorflow_venv/lib/python3.7/site-packages/numpy/cor line 2618, in amin
    initial=initial)
  File "/home/wanrun/denghui/dp-api/tensorflow_venv/lib/python3.7/site-packages/numpy/cor line 86, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: zero-size array to reduction operation minimum which has no identity
-----------------------------------------------------------------------------------------------------------------------------------------
The used deepmd version info:
DEEPMD INFO    |-> deepmd.entrypoints.train                      source :              v1.2.2-382-g5d21c7f-dirty
DEEPMD INFO    |-> deepmd.entrypoints.train                      source brach:         api-summit
DEEPMD INFO    |-> deepmd.entrypoints.train                      source commit:        5d21c7f
DEEPMD INFO    |-> deepmd.entrypoints.train                      source commit at:     2021-04-04 17:57:36 +0800

Deepmd-kit version, installation way, input file, running commands, error log, etc.

Steps to Reproduce

Further Information, Files, and Links

denghuilu commented 3 years ago

Which version of deepmd-kit was used for the model training?

baoqinfu commented 3 years ago

Which version of deepmd-kit was used for the model training?

Sorry to the late reply, the version info of the deepmd-kit used for the model training is :

DEEPMD: ---Summary of the training---------------------------------------

DEEPMD: installed to: /tmp/pip-req-build-gek426j1/_skbuild/linux-x86_64-3.8/cmake-install

DEEPMD: source : v1.2.2-85-gb96112e-dirty

DEEPMD: source brach: devel

DEEPMD: source commit: b96112e

DEEPMD: source commit at: 2020-11-26 14:12:51 +0800


baoqinfu commented 3 years ago

I used 'dp convert-from -I old.pb -o new.pb ‘1.2’ ' to transform the old model into the new model (2.0 support). then used "/opt/deepmd-kit-2.0.0.b3/bin/dp compress -i SiClda1_c3.pb -o SiClda1_com.pb input2.json " to compress the model, but I have get the similar error:

2021-07-13 17:39:10.288795: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1 WARNING:tensorflow:From /opt/deepmd-kit-2.0.0.b3/lib/python3.9/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term WARNING:root:Environment variable KMP_BLOCKTIME is empty. Use the default value 0 WARNING:root:Environment variable KMP_AFFINITY is empty. Use the default value granularity=fine,verbose,compact,1,0 DEEPMD INFO

DEEPMD INFO stage 1: train or refine the model with tabulation DEEPMD INFO __
DEEPMD INFO |
\ | \ | \/ || \ | | (_)| |
DEEPMD INFO | | | | | |) || \ / || | | | ____ | | | | DEEPMD INFO | | | | / \ / | _/ | |\/| || | | ||____|| |/ /| || | DEEPMD INFO | || || /| /| | | | | || || | | < | || |_ DEEPMD INFO |___/ _| _||| || |_||___/ ||_|| __| DEEPMD INFO Please read and cite: DEEPMD INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018) DEEPMD INFO installed to: /tmp/pip-req-build-qqv2ggzp/_skbuild/linux-x86_64-3.9/cmake-install DEEPMD INFO source : v2.0.0.b3 DEEPMD INFO source brach: HEAD DEEPMD INFO source commit: de428e3 DEEPMD INFO source commit at: 2021-07-04 22:12:13 +0800 DEEPMD INFO build float prec: double DEEPMD INFO build with tf inc: /opt/deepmd-kit-2.0.0.b3/lib/python3.9/site-packages/tensorflow/include;/opt/deepmd-kit-2.0.0.b3/lib/python3.9/site-packages/tensorflow/include DEEPMD INFO build with tf lib:
DEEPMD INFO ---Summary of the training--------------------------------------- DEEPMD INFO running on: login0 DEEPMD INFO CUDA_VISIBLE_DEVICES: unset DEEPMD INFO num_intra_threads: 0 DEEPMD INFO num_inter_threads: 0 DEEPMD INFO ----------------------------------------------------------------- 2021-07-13 17:39:18.657117: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-07-13 17:39:18.658367: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64: 2021-07-13 17:39:18.658392: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303) 2021-07-13 17:39:18.658412: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (login0): /proc/driver/nvidia/version does not exist 2021-07-13 17:39:18.658425: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set /opt/deepmd-kit-2.0.0.b3/lib/python3.9/site-packages/deepmd/utils/data_system.py:156: UserWarning: system ../data/data.init/C_atomSpin2/dpmd required batch size is larger than the size of the dataset ../data/data.init/C_atomSpin2/dpmd/set.000 (32 > 1) warnings.warn("system %s required batch size is larger than the size of the dataset %s (%d > %d)" % \ /opt/deepmd-kit-2.0.0.b3/lib/python3.9/site-packages/deepmd/utils/data_system.py:156: UserWarning: system ../data/data.init/Si_atomSpin2/dpmd required batch size is larger than the size of the dataset ../data/data.init/Si_atomSpin2/dpmd/set.000 (32 > 1) warnings.warn("system %s required batch size is larger than the size of the dataset %s (%d > %d)" % \ DEEPMD INFO ---Summary of DataSystem: training ----------------------------------------------- DEEPMD INFO found 36 system(s): DEEPMD INFO system natoms bch_sz n_bch prob pbc DEEPMD INFO -- H-4.02x02x02/02.md/sys-0016-0016/deepmd 32 1 600 0.208 T DEEPMD INFO -- C-2.03x03x03/02.md/sys-0027-0027/deepmd 54 1 600 0.208 T DEEPMD INFO -- H-8.02x02x01/02.md/sys-0016-0016/deepmd 32 1 600 0.208 T DEEPMD INFO -- -12.02x02x01/02.md/sys-0024-0024/deepmd 48 1 600 0.208 T DEEPMD INFO ../data/data.init/C_atomSpin2/dpmd 1 32 1 0.000 T DEEPMD INFO ../data/data.init/Si_atomSpin2/dpmd 1 32 1 0.000 T DEEPMD INFO -- a/data.iters/iter.000000/02.fp/data.000 32 1 7 0.002 T DEEPMD INFO -- a/data.iters/iter.000000/02.fp/data.010 54 1 21 0.007 T DEEPMD INFO -- a/data.iters/iter.000000/02.fp/data.020 32 1 17 0.006 T DEEPMD INFO -- a/data.iters/iter.000000/02.fp/data.030 48 1 19 0.007 T DEEPMD INFO -- a/data.iters/iter.000002/02.fp/data.012 54 1 8 0.003 T DEEPMD INFO -- a/data.iters/iter.000002/02.fp/data.022 32 1 5 0.002 T DEEPMD INFO -- a/data.iters/iter.000002/02.fp/data.032 48 1 8 0.003 T DEEPMD INFO -- a/data.iters/iter.000003/02.fp/data.003 32 1 15 0.005 T DEEPMD INFO -- a/data.iters/iter.000003/02.fp/data.013 54 1 12 0.004 T DEEPMD INFO -- a/data.iters/iter.000003/02.fp/data.023 32 1 10 0.003 T DEEPMD INFO -- a/data.iters/iter.000003/02.fp/data.033 48 1 12 0.004 T DEEPMD INFO -- a/data.iters/iter.000004/02.fp/data.004 32 1 16 0.006 T DEEPMD INFO -- a/data.iters/iter.000004/02.fp/data.024 32 1 6 0.002 T DEEPMD INFO -- a/data.iters/iter.000005/02.fp/data.005 32 1 5 0.002 T DEEPMD INFO -- a/data.iters/iter.000006/02.fp/data.006 32 1 22 0.008 T DEEPMD INFO -- a/data.iters/iter.000006/02.fp/data.016 54 1 34 0.012 T DEEPMD INFO -- a/data.iters/iter.000006/02.fp/data.026 32 1 18 0.006 T DEEPMD INFO -- a/data.iters/iter.000006/02.fp/data.036 48 1 19 0.007 T DEEPMD INFO -- a/data.iters/iter.000007/02.fp/data.007 32 1 6 0.002 T DEEPMD INFO -- a/data.iters/iter.000007/02.fp/data.017 54 1 6 0.002 T DEEPMD INFO -- a/data.iters/iter.000008/02.fp/data.008 32 1 6 0.002 T DEEPMD INFO -- a/data.iters/iter.000008/02.fp/data.038 48 1 5 0.002 T DEEPMD INFO -- a/data.iters/iter.000009/02.fp/data.009 32 1 27 0.009 T DEEPMD INFO -- a/data.iters/iter.000009/02.fp/data.019 54 1 19 0.007 T DEEPMD INFO -- a/data.iters/iter.000009/02.fp/data.029 32 1 15 0.005 T DEEPMD INFO -- a/data.iters/iter.000009/02.fp/data.039 48 1 20 0.007 T DEEPMD INFO -- a/data.iters/iter.000010/02.fp/data.009 32 1 39 0.014 T DEEPMD INFO -- a/data.iters/iter.000010/02.fp/data.019 54 1 9 0.003 T DEEPMD INFO -- a/data.iters/iter.000010/02.fp/data.029 32 1 30 0.010 T DEEPMD INFO -- a/data.iters/iter.000010/02.fp/data.039 48 1 40 0.014 T DEEPMD INFO -------------------------------------------------------------------------------------- DEEPMD INFO ---Summary of DataSystem: validation ----------------------------------------------- DEEPMD INFO found 36 system(s): DEEPMD INFO system natoms bch_sz n_bch prob pbc DEEPMD INFO -- H-4.02x02x02/02.md/sys-0016-0016/deepmd 32 1 600 0.208 T DEEPMD INFO -- C-2.03x03x03/02.md/sys-0027-0027/deepmd 54 1 600 0.208 T DEEPMD INFO -- H-8.02x02x01/02.md/sys-0016-0016/deepmd 32 1 600 0.208 T DEEPMD INFO -- -12.02x02x01/02.md/sys-0024-0024/deepmd 48 1 600 0.208 T DEEPMD INFO ../data/data.init/C_atomSpin2/dpmd 1 1 1 0.000 T DEEPMD INFO ../data/data.init/Si_atomSpin2/dpmd 1 1 1 0.000 T DEEPMD INFO -- a/data.iters/iter.000000/02.fp/data.000 32 1 7 0.002 T DEEPMD INFO -- a/data.iters/iter.000000/02.fp/data.010 54 1 21 0.007 T DEEPMD INFO -- a/data.iters/iter.000000/02.fp/data.020 32 1 17 0.006 T DEEPMD INFO -- a/data.iters/iter.000000/02.fp/data.030 48 1 19 0.007 T DEEPMD INFO -- a/data.iters/iter.000002/02.fp/data.012 54 1 8 0.003 T DEEPMD INFO -- a/data.iters/iter.000002/02.fp/data.022 32 1 5 0.002 T DEEPMD INFO -- a/data.iters/iter.000002/02.fp/data.032 48 1 8 0.003 T DEEPMD INFO -- a/data.iters/iter.000003/02.fp/data.003 32 1 15 0.005 T DEEPMD INFO -- a/data.iters/iter.000003/02.fp/data.013 54 1 12 0.004 T DEEPMD INFO -- a/data.iters/iter.000003/02.fp/data.023 32 1 10 0.003 T DEEPMD INFO -- a/data.iters/iter.000003/02.fp/data.033 48 1 12 0.004 T DEEPMD INFO -- a/data.iters/iter.000004/02.fp/data.004 32 1 16 0.006 T DEEPMD INFO -- a/data.iters/iter.000004/02.fp/data.024 32 1 6 0.002 T DEEPMD INFO -- a/data.iters/iter.000005/02.fp/data.005 32 1 5 0.002 T DEEPMD INFO -- a/data.iters/iter.000006/02.fp/data.006 32 1 22 0.008 T DEEPMD INFO -- a/data.iters/iter.000006/02.fp/data.016 54 1 34 0.012 T DEEPMD INFO -- a/data.iters/iter.000006/02.fp/data.026 32 1 18 0.006 T DEEPMD INFO -- a/data.iters/iter.000006/02.fp/data.036 48 1 19 0.007 T DEEPMD INFO -- a/data.iters/iter.000007/02.fp/data.007 32 1 6 0.002 T DEEPMD INFO -- a/data.iters/iter.000007/02.fp/data.017 54 1 6 0.002 T DEEPMD INFO -- a/data.iters/iter.000008/02.fp/data.008 32 1 6 0.002 T DEEPMD INFO -- a/data.iters/iter.000008/02.fp/data.038 48 1 5 0.002 T DEEPMD INFO -- a/data.iters/iter.000009/02.fp/data.009 32 1 27 0.009 T DEEPMD INFO -- a/data.iters/iter.000009/02.fp/data.019 54 1 19 0.007 T DEEPMD INFO -- a/data.iters/iter.000009/02.fp/data.029 32 1 15 0.005 T DEEPMD INFO -- a/data.iters/iter.000009/02.fp/data.039 48 1 20 0.007 T DEEPMD INFO -- a/data.iters/iter.000010/02.fp/data.009 32 1 39 0.014 T DEEPMD INFO -- a/data.iters/iter.000010/02.fp/data.019 54 1 9 0.003 T DEEPMD INFO -- a/data.iters/iter.000010/02.fp/data.029 32 1 30 0.010 T DEEPMD INFO -- a/data.iters/iter.000010/02.fp/data.039 48 1 40 0.014 T DEEPMD INFO -------------------------------------------------------------------------------------- DEEPMD INFO training without frame parameter 2021-07-13 17:39:21.475630: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes) 2021-07-13 17:39:21.476746: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2500000000 Hz OMP: Info #155: KMP_AFFINITY: Initial OS proc set respected: 0-3 OMP: Info #216: KMP_AFFINITY: decoding x2APIC ids. OMP: Info #216: KMP_AFFINITY: cpuid leaf 11 not supported. OMP: Info #216: KMP_AFFINITY: decoding legacy APIC ids. OMP: Info #157: KMP_AFFINITY: 4 available OS procs OMP: Info #158: KMP_AFFINITY: Uniform topology OMP: Info #287: KMP_AFFINITY: topology layer "LL cache" is equivalent to "socket". OMP: Info #192: KMP_AFFINITY: 1 socket x 2 cores/socket x 2 threads/core (2 total cores) OMP: Info #218: KMP_AFFINITY: OS proc to physical thread map: OMP: Info #172: KMP_AFFINITY: OS proc 0 maps to socket 0 core 0 thread 0 OMP: Info #172: KMP_AFFINITY: OS proc 1 maps to socket 0 core 0 thread 1 OMP: Info #172: KMP_AFFINITY: OS proc 2 maps to socket 0 core 1 thread 0 OMP: Info #172: KMP_AFFINITY: OS proc 3 maps to socket 0 core 1 thread 1 OMP: Info #254: KMP_AFFINITY: pid 3346 tid 5763 thread 0 bound to OS proc set 0 OMP: Info #254: KMP_AFFINITY: pid 3346 tid 6478 thread 1 bound to OS proc set 2 OMP: Info #254: KMP_AFFINITY: pid 3346 tid 6480 thread 3 bound to OS proc set 3 OMP: Info #254: KMP_AFFINITY: pid 3346 tid 6479 thread 2 bound to OS proc set 1 OMP: Info #254: KMP_AFFINITY: pid 3346 tid 5764 thread 4 bound to OS proc set 0 OMP: Info #254: KMP_AFFINITY: pid 3346 tid 6485 thread 7 bound to OS proc set 3 OMP: Info #254: KMP_AFFINITY: pid 3346 tid 6484 thread 6 bound to OS proc set 1 OMP: Info #254: KMP_AFFINITY: pid 3346 tid 6482 thread 5 bound to OS proc set 2 OMP: Info #254: KMP_AFFINITY: pid 3346 tid 3346 thread 8 bound to OS proc set 0 2021-07-13 17:39:22.998743: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set OMP: Info #254: KMP_AFFINITY: pid 3346 tid 5762 thread 9 bound to OS proc set 2 OMP: Info #254: KMP_AFFINITY: pid 3346 tid 7314 thread 10 bound to OS proc set 1 OMP: Info #254: KMP_AFFINITY: pid 3346 tid 7315 thread 11 bound to OS proc set 3 OMP: Info #254: KMP_AFFINITY: pid 3346 tid 7316 thread 12 bound to OS proc set 0 Traceback (most recent call last): File "/opt/deepmd-kit-2.0.0.b3/bin/dp", line 10, in sys.exit(main()) File "/opt/deepmd-kit-2.0.0.b3/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 429, in main compress(dict_args) File "/opt/deepmd-kit-2.0.0.b3/lib/python3.9/site-packages/deepmd/entrypoints/compress.py", line 97, in compress train( File "/opt/deepmd-kit-2.0.0.b3/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 212, in train _do_work(jdata, run_opt) File "/opt/deepmd-kit-2.0.0.b3/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 262, in _do_work model.build(train_data, stop_batch) File "/opt/deepmd-kit-2.0.0.b3/lib/python3.9/site-packages/deepmd/train/trainer.py", line 306, in build = self.neighbor_stat.get_stat(data) File "/opt/deepmd-kit-2.0.0.b3/lib/python3.9/site-packages/deepmd/utils/neighbor_stat.py", line 85, in get_stat dt = np.min(dt) File "<__array_function__ internals>", line 5, in amin File "/opt/deepmd-kit-2.0.0.b3/lib/python3.9/site-packages/numpy/core/fromnumeric.py", line 2858, in amin return _wrapreduction(a, np.minimum, 'min', axis, None, out, File "/opt/deepmd-kit-2.0.0.b3/lib/python3.9/site-packages/numpy/core/fromnumeric.py", line 87, in _wrapreduction return ufunc.reduce(obj, axis, dtype, out, passkwargs) ValueError: zero-size array to reduction operation minimum which has no identity

nicklin96 commented 3 years ago

I noticed the warning: required batch size is larger than the size of the dataset ../data/data.init/C_atomSpin2/dpmd/set.000 (32 > 1) maybe try reducing batch size to 1?

baoqinfu commented 3 years ago

I noticed the warning: required batch size is larger than the size of the dataset ../data/data.init/C_atomSpin2/dpmd/set.000 (32 > 1) maybe try reducing batch size to 1?

but the batch was set "auto" in the input file for dp-compress. I have set "batch_size" to 1 and run again, the similar error info was shown.

...... OMP: Info #155: KMP_AFFINITY: Initial OS proc set respected: 0-3 OMP: Info #216: KMP_AFFINITY: decoding x2APIC ids. OMP: Info #216: KMP_AFFINITY: cpuid leaf 11 not supported. OMP: Info #216: KMP_AFFINITY: decoding legacy APIC ids. OMP: Info #157: KMP_AFFINITY: 4 available OS procs OMP: Info #158: KMP_AFFINITY: Uniform topology OMP: Info #287: KMP_AFFINITY: topology layer "LL cache" is equivalent to "socket". OMP: Info #192: KMP_AFFINITY: 1 socket x 2 cores/socket x 2 threads/core (2 total cores) OMP: Info #218: KMP_AFFINITY: OS proc to physical thread map: OMP: Info #172: KMP_AFFINITY: OS proc 0 maps to socket 0 core 0 thread 0 OMP: Info #172: KMP_AFFINITY: OS proc 1 maps to socket 0 core 0 thread 1 OMP: Info #172: KMP_AFFINITY: OS proc 2 maps to socket 0 core 1 thread 0 OMP: Info #172: KMP_AFFINITY: OS proc 3 maps to socket 0 core 1 thread 1 OMP: Info #254: KMP_AFFINITY: pid 16836 tid 20922 thread 0 bound to OS proc set 0 OMP: Info #254: KMP_AFFINITY: pid 16836 tid 21800 thread 1 bound to OS proc set 2 OMP: Info #254: KMP_AFFINITY: pid 16836 tid 21801 thread 2 bound to OS proc set 1 OMP: Info #254: KMP_AFFINITY: pid 16836 tid 21802 thread 3 bound to OS proc set 3 OMP: Info #254: KMP_AFFINITY: pid 16836 tid 20920 thread 4 bound to OS proc set 0 OMP: Info #254: KMP_AFFINITY: pid 16836 tid 21804 thread 5 bound to OS proc set 2 OMP: Info #254: KMP_AFFINITY: pid 16836 tid 21805 thread 6 bound to OS proc set 1 OMP: Info #254: KMP_AFFINITY: pid 16836 tid 21806 thread 7 bound to OS proc set 3 OMP: Info #254: KMP_AFFINITY: pid 16836 tid 16836 thread 8 bound to OS proc set 0 2021-07-13 22:03:17.100625: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set Traceback (most recent call last): File "/opt/deepmd-kit-2.0.0.b3/bin/dp", line 10, in sys.exit(main()) File "/opt/deepmd-kit-2.0.0.b3/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 429, in main compress(dict_args) File "/opt/deepmd-kit-2.0.0.b3/lib/python3.9/site-packages/deepmd/entrypoints/compress.py", line 97, in compress train( File "/opt/deepmd-kit-2.0.0.b3/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 212, in train _do_work(jdata, run_opt) File "/opt/deepmd-kit-2.0.0.b3/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 262, in _do_work model.build(train_data, stop_batch) File "/opt/deepmd-kit-2.0.0.b3/lib/python3.9/site-packages/deepmd/train/trainer.py", line 306, in build = self.neighbor_stat.get_stat(data) File "/opt/deepmd-kit-2.0.0.b3/lib/python3.9/site-packages/deepmd/utils/neighbor_stat.py", line 85, in get_stat dt = np.min(dt) File "<__array_function__ internals>", line 5, in amin File "/opt/deepmd-kit-2.0.0.b3/lib/python3.9/site-packages/numpy/core/fromnumeric.py", line 2858, in amin return _wrapreduction(a, np.minimum, 'min', axis, None, out, File "/opt/deepmd-kit-2.0.0.b3/lib/python3.9/site-packages/numpy/core/fromnumeric.py", line 87, in _wrapreduction return ufunc.reduce(obj, axis, dtype, out, passkwargs) ValueError: zero-size array to reduction operation minimum which has no identity

njzjz commented 3 years ago

Can you provide your input file?

baoqinfu commented 3 years ago

Can you provide your input file?

the below is my input file:


{ "_comment": "v2.0", "model": { "type_map": [ "Si", "C" ], "descriptor": { "type": "se_e2_a", "sel": [ 300, 300 ], "rcut_smth": 0.5, "rcut": 9.0, "neuron": [ 25, 50, 100 ], "resnet_dt": false, "axis_neuron": 12, "seed": 678568530 }, "fitting_net": { "neuron": [ 240, 240, 240 ], "resnet_dt": true, "seed": 4111181373 } }, "loss": { "start_pref_e": 0.02, "limit_pref_e": 2, "start_pref_f": 1000, "limit_pref_f": 1, "start_pref_v": 0.01, "limit_pref_v": 1 }, "learning_rate": { "start_lr": 0.001, "decay_steps": 20000, "_decay_rate": 0.95 }, "training": { "training_data":{ "systems": "../data/", "batch_size": "auto" }, "validation_data":{ "systems": "../data/", "batch_size": "auto", "numb_btch": 4, "_comment": "that's all"
}, "numb_steps": 4000000, "seed": 2061570774, "_comment": "that's all", "disp_file": "lcurve.out", "disp_freq": 2000, "numb_test": 1, "save_freq": 2000, "save_ckpt": "model.ckpt", "disp_training": true, "time_training": true, "profiling": false, "profiling_file": "timeline.json" } }


njzjz commented 3 years ago

869 reported a same error.

njzjz commented 3 years ago

Fixed in #882.