Vibsteamer commented 3 years ago

Summary Using 2.0 beta3 to convert a version-1.3 model, then compress it together with the original training script "input,json" (manually revised to fitting the "strict" check in 2.0), raise dim mismatch between the old(converted) graph and the raw graph.

Deepmd-kit version, installation way, input file, running commands, error log, etc.

for training the version-1.3 model : （the image name on ALI is kit-1.3.0, though source writes 1.2.2 ）

source :              v1.2.2-96-g15948fa
source brach:         HEAD
source commit:        15948fa
source commit at:     2021-01-01 17:17:33 +0800
build float prec:     double

input file (the part related to net size):

{
    "model": {
        "descriptor": {
            "type": "se_a",
            "sel": [
                300,
                300,
                300
            ],
            "rcut_smth": 2.0,
            "rcut": 6.0,
            "neuron": [
                25,
                50,
                100
            ],
            "resnet_dt": false,
            "axis_neuron": 12,
            "type_one_side": true,
            "seed": 2687978781
        },
        "fitting_net": {
            "n_neuron": [
                240,
                240,
                240
            ],
            "resnet_dt": true,
            "coord_norm": true,
            "type_fitting_net": false,
            "seed": 2706322555
        },
...

for convert

source :              v2.0.0.b3
source brach:         HEAD
source commit:        de428e3
source commit at:     2021-07-04 22:12:13 +0800
build float prec:     double

convert command: both tried, resulting the same error message after compress

dp convert-from -i frozen_model.pb -o graph.000.new.pb '1.3'

dp convert-from -i frozen_model.pb -o graph.000.new.pb '1.2'

for compress:

source :              v2.0.0.b3
source brach:         HEAD
source commit:        de428e3
source commit at:     2021-07-04 22:12:13 +0800
build float prec:     double

command:

dp cpmpress -i graph.000.new.pb -o graph.000.new.compress.pb input.json

part of the input.json (has to be manually revised from the original on above to satisfy the "strict" format check in 2.0)

{
    "model": {
        "descriptor": {
            "type": "se_e2_a",
            "sel": [
                300,
                300,
                300
            ],
            "rcut_smth": 2.0,
            "rcut": 6.0,
            "neuron": [
                25,
                50,
                100
            ],
            "resnet_dt": false,
            "axis_neuron": 12,
            "type_one_side": true,
            "seed": 2687978781
        },
        "fitting_net": {
            "neuron": [
                240,
                240,
                240
            ],
            "resnet_dt": true,
            "_coord_norm": true,
            "_type_fitting_net": false,
            "seed": 2706322555
        },
...

ERROR MESSAGE when compress:

DEEPMD INFO    stage 3: transfer the model
DEEPMD INFO    792 ops in the raw graph
DEEPMD INFO    1110 ops in the old graph
DEEPMD INFO    descrpt_attr/t_avg is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>)
DEEPMD INFO    descrpt_attr/t_std is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>)
Traceback (most recent call last):
  File "/opt/deepmd-kit-2.0.0.b3/bin/dp", line 10, in <module>
    sys.exit(main())
  File "/opt/deepmd-kit-2.0.0.b3/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 429, in main
    compress(**dict_args)
  File "/opt/deepmd-kit-2.0.0.b3/lib/python3.9/site-packages/deepmd/entrypoints/compress.py", line 115, in compress
    transfer(old_model=input, raw_model=output, output=output)
  File "/opt/deepmd-kit-2.0.0.b3/lib/python3.9/site-packages/deepmd/entrypoints/transfer.py", line 72, in transfer
    new_graph_def = transform_graph(raw_graph, old_graph)
  File "/opt/deepmd-kit-2.0.0.b3/lib/python3.9/site-packages/deepmd/entrypoints/transfer.py", line 127, in transform_graph
    check_dim(raw_graph_node, old_graph_node, node.name)
  File "/opt/deepmd-kit-2.0.0.b3/lib/python3.9/site-packages/deepmd/entrypoints/transfer.py", line 215, in check_dim
    raise RuntimeError(
RuntimeError: old graph dim {
  size: 1200
}
dim {
  size: 120
}
 and raw graph dim {
  size: 1200
}
dim {
  size: 240
}
has different layer_0_type_0/matrix dim

platform: ALI

Steps to Reproduce

1). train several steps a model with the original input and kit-1.3 2). convert it to new_model using kit-2.0 beta3 3). compress new_model using the manually_revised_for_strict_check input,json and kit-2.0 beta3

NOTE: -->train several steps a model with the manually_revised_for_strict_check input.json and kit-2.0 beta3, then compress it will be fine.

Further Information, Files, and Links input.json(1.3) input.json(manually revised from 1.3 for the strict check in 2.0) frozen_model.pb(1.3) graph.000.new.pb(converted) files.zip

URL for data: （please match the structure of directories with those written in two input.json, when reproducing） http://dplibrary.deepmd.net/#/project_details?project_id=202010.002

WARNING: Please leave few training data (delete most of them and correspondingly revise the input.json) when reproducing. Or the complete data sets will consume lost of your life when initializing the training, especially when using rotating disks.

denghuilu commented 3 years ago

In fact, it's a miss use of the training script. I changed the parameter model[fitting_net][neuron] from 240 to 120 in input.json, all problems solved. Please check the parameter and rerun the program.

Vibsteamer commented 3 years ago

In fact, it's a miss use of the training script. I changed the parameter model[fitting_net][neuron] from 240 to 120 in input.json, all problems solved. Please check the parameter and rerun the program.

Thank you for your reply. But the fitting net has never been changed, always being [240.240.240].

Can you get a normal result following the reproduce procedure, using the input files and version of codes described above? :

Steps to Reproduce

1). train several steps a model with the original input and kit-1.3 2). convert it to new_model using kit-2.0 beta3 3). compress new_model using the manually_revised_for_strict_check input,json and kit-2.0 beta3

denghuilu commented 3 years ago

In fact, it's a miss use of the training script. I changed the parameter model[fitting_net][neuron] from 240 to 120 in input.json, all problems solved. Please check the parameter and rerun the program.

Thank you for your reply. But the fitting net has never been changed, always being [240.240.240].

Can you get a normal result following the reproduce procedure, using the input files and version of codes described above? :

Steps to Reproduce

1). train several steps a model with the original input and kit-1.3 2). convert it to new_model using kit-2.0 beta3 3). compress new_model using the manually_revised_for_strict_check input,json and kit-2.0 beta3

I can get the correct result through the following steps:

training by the 1.3 dp:


root ISSUE-846 $ dp -h
WARNING:tensorflow:From /root/dp-master/tensorflow_venv/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
usage: dp [-h] {transform,train,freeze,test,convert-to,doc-train-input} ...

DeePMD-kit: A deep learning package for many-body potential energy representation and molecular dynamics

optional arguments: -h, --help show this help message and exit

Valid subcommands: {transform,train,freeze,test,convert-to,doc-train-input} transform pass parameters to another model train train a model freeze freeze the model test test the model convert-to convert dp-1.3 model to higher model compatibility doc-train-input print the documentation (in rst format) of input training parameters. root ISSUE-846 $ dp train input_1.3.json WARNING:tensorflow:From /root/dp-master/tensorflow_venv/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term

DEEPMD: ____

DEEPMD: | \ | \ | \/ || _ \ | | ()| |

DEEPMD: | | | | | |) || \ / || | | | ____ | | _ | |_

DEEPMD: | | | | / \ / | _/ | |\/| || | | ||__|| |/ /| || |

DEEPMD: | || || /| /| | | | | || || | | < | || |_

DEEPMD: |__/ \| _||| || |_||___/ ||_|| __|

DEEPMD:

DEEPMD: Please read and cite:

DEEPMD: Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)

DEEPMD:

DEEPMD: ---Summary of the training---------------------------------------

DEEPMD: installed to: /tmp/pip-req-build-_1b4jy64/_skbuild/linux-x86_64-3.6/cmake-install

DEEPMD: source : v1.2.2-167-g53b886c

DEEPMD: source brach: master

DEEPMD: source commit: 53b886c

DEEPMD: source commit at: 2021-06-21 14:19:24 +0800

DEEPMD: build float prec: double

DEEPMD: build with tf inc: /root/dp-master/tensorflow_venv/lib/python3.6/site-packages/tensorflow/include

DEEPMD: build with tf lib:

DEEPMD: running on: iZ2zeedzsx4jorjze9gyq7Z

DEEPMD: CUDA_VISIBLE_DEVICES: unset

DEEPMD: num_intra_threads: 0

DEEPMD: num_inter_threads: 0

DEEPMD: -----------------------------------------------------------------

DEEPMD:

DEEPMD: ---Summary of DataSystem------------------------------------------------

DEEPMD: found 3 system(s):

DEEPMD: system natoms bch_sz n_bch n_test prob

DEEPMD: -- bug-fix-related/ISSUE-846/data/init.997 32 1 10 1 0.333

DEEPMD: -- bug-fix-related/ISSUE-846/data/init.998 32 1 10 1 0.333

DEEPMD: -- bug-fix-related/ISSUE-846/data/init.999 32 1 10 1 0.333

DEEPMD: ------------------------------------------------------------------------

DEEPMD:

DEEPMD: training without frame parameter

DEEPMD: built lr

DEEPMD: built network

DEEPMD: built training

DEEPMD: initialize model from scratch

DEEPMD: start training at lr 1.00e-03 (== 1.00e-03), decay_step 80000, decay_rate 0.944061, final lr will be 1.00e-08

DEEPMD: batch 2000 training time 29.45 s, testing time 0.01 s

DEEPMD: saved checkpoint model.ckpt

DEEPMD: batch 4000 training time 28.40 s, testing time 0.01 s

DEEPMD: saved checkpoint model.ckpt


- freeze the model by the 1.3 dp:

root ISSUE-846 $ dp freeze -o 1.3.pb WARNING:tensorflow:From /root/dp-master/tensorflow_venv/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term The following nodes will be frozen: o_energy,o_force,o_virial,o_atom_energy,o_atom_virial,descrpt_attr/rcut,descrpt_attr/ntypes,fitting_attr/dfparam,fitting_attr/daparam,model_attr/tmap,model_attr/model_type WARNING:tensorflow:From /root/dp-master/tensorflow_venv/lib/python3.6/site-packages/deepmd/freeze.py:79: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version. Instructions for updating: Use tf.compat.v1.graph_util.convert_variables_to_constants WARNING:tensorflow:From /root/dp-master/tensorflow_venv/lib/python3.6/site-packages/tensorflow/python/framework/convert_to_constants.py:856: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version. Instructions for updating: Use tf.compat.v1.graph_util.extract_sub_graph 1109 ops in the final graph.


- convert the model from 1.3 to 2.0 by the 2.0 dp:

root ISSUE-846 $ dp convert-from 1.3 -i 1.3.pb -o 2.0.pb WARNING:tensorflow:From /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term WARNING:tensorflow:From /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/deepmd/utils/convert.py:24: FastGFile.init (from tensorflow.python.platform.gfile) is deprecated and will be removed in a future version. Instructions for updating: Use tf.gfile.GFile. DEEPMD WARNING From /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/deepmd/utils/convert.py:24: FastGFile.init (from tensorflow.python.platform.gfile) is deprecated and will be removed in a future version. Instructions for updating: Use tf.gfile.GFile. the converted output model (2.0 support) is saved in 2.0.pb


- and compress the converted model by the 2.0 dp:

root ISSUE-846 $ dp compress out.json -i 2.0.pb -o 2.0-compress.pb WARNING:tensorflow:From /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term DEEPMD INFO

DEEPMD INFO stage 1: train or refine the model with tabulation DEEPMD INFO __ DEEPMD INFO | \ | \ | \/ || \ | | (_)| | DEEPMD INFO | | | | | |) || \ / || | | | ____ | | | | DEEPMD INFO | | | | / \ / | _/ | |\/| || | | ||____|| |/ /| || | DEEPMD INFO | || || /| /| | | | | || || | | < | || |_ DEEPMD INFO |___/ _| _||| || |_||___/ ||_|| __| DEEPMD INFO Please read and cite: DEEPMD INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018) DEEPMD INFO installed to: /tmp/pip-req-build-cmjfgch8/_skbuild/linux-x86_64-3.6/cmake-install DEEPMD INFO source : v1.2.2-770-g12b0bd6 DEEPMD INFO source brach: devel DEEPMD INFO source commit: 12b0bd6 DEEPMD INFO source commit at: 2021-07-28 10:06:20 +0800 DEEPMD INFO build float prec: double DEEPMD INFO build with tf inc: /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/tensorflow/include;/root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/tensorflow/include DEEPMD INFO build with tf lib: DEEPMD INFO ---Summary of the training--------------------------------------- DEEPMD INFO running on: iZ2zeedzsx4jorjze9gyq7Z DEEPMD INFO CUDA_VISIBLE_DEVICES: unset DEEPMD INFO num_intra_threads: 0 DEEPMD INFO num_inter_threads: 0 DEEPMD INFO ----------------------------------------------------------------- DEEPMD INFO ---Summary of DataSystem: training ----------------------------------------------- DEEPMD INFO found 3 system(s): DEEPMD INFO system natoms bch_sz n_bch prob pbc DEEPMD INFO -- bug-fix-related/ISSUE-846/data/init.997 32 1 10 0.333 T DEEPMD INFO -- bug-fix-related/ISSUE-846/data/init.998 32 1 10 0.333 T DEEPMD INFO -- bug-fix-related/ISSUE-846/data/init.999 32 1 10 0.333 T DEEPMD INFO -------------------------------------------------------------------------------------- DEEPMD INFO training without frame parameter DEEPMD INFO training data with min nbor dist: 3.092825818114446 DEEPMD INFO training data with max nbor size: [21, 12, 26] DEEPMD INFO training data with lower boundary: -0.13109273996841408 DEEPMD INFO training data with upper boundary: 8.947001116142355 DEEPMD INFO built lr DEEPMD INFO built network DEEPMD INFO built training DEEPMD INFO initialize model from scratch DEEPMD INFO start training at lr 1.00e-03 (== 1.00e-03), decay_step 100, decay_rate 0.562341, final lr will be 1.00e-08 DEEPMD INFO batch 2000 training time 9.80 s, testing time 0.00 s DEEPMD INFO saved checkpoint model.ckpt DEEPMD INFO finished training DEEPMD INFO wall time: 12.211 s DEEPMD INFO

DEEPMD INFO stage 2: freeze the model INFO:tensorflow:Restoring parameters from /root/dp-bug-fix-related/ISSUE-846/model.ckpt DEEPMD INFO Restoring parameters from /root/dp-bug-fix-related/ISSUE-846/model.ckpt The following nodes will be frozen: ['descrpt_attr/rcut', 'descrpt_attr/ntypes', 'model_attr/tmap', 'model_attr/model_type', 'model_attr/model_version', 'o_energy', 'o_force', 'o_virial', 'o_atom_energy', 'o_atom_virial', 'fitting_attr/dfparam', 'fitting_attr/daparam'] WARNING:tensorflow:From /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/deepmd/entrypoints/freeze.py:183: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version. Instructions for updating: Use tf.compat.v1.graph_util.convert_variables_to_constants DEEPMD WARNING From /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/deepmd/entrypoints/freeze.py:183: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version. Instructions for updating: Use tf.compat.v1.graph_util.convert_variables_to_constants WARNING:tensorflow:From /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/tensorflow/python/framework/convert_to_constants.py:856: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version. Instructions for updating: Use tf.compat.v1.graph_util.extract_sub_graph DEEPMD WARNING From /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/tensorflow/python/framework/convert_to_constants.py:856: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version. Instructions for updating: Use tf.compat.v1.graph_util.extract_sub_graph 792 ops in the final graph. DEEPMD INFO

DEEPMD INFO stage 3: transfer the model DEEPMD INFO 792 ops in the raw graph DEEPMD INFO 1110 ops in the old graph DEEPMD INFO descrpt_attr/t_avg is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO descrpt_attr/t_std is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO layer_0_type_0/matrix is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO layer_0_type_0/bias is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO layer_1_type_0/matrix is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO layer_1_type_0/bias is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO layer_1_type_0/idt is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO layer_2_type_0/matrix is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO layer_2_type_0/bias is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO layer_2_type_0/idt is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO final_layer_type_0/matrix is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO final_layer_type_0/bias is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO layer_0_type_1/matrix is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO layer_0_type_1/bias is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO layer_1_type_1/matrix is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO layer_1_type_1/bias is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO layer_1_type_1/idt is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO layer_2_type_1/matrix is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO layer_2_type_1/bias is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO layer_2_type_1/idt is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO final_layer_type_1/matrix is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO final_layer_type_1/bias is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO layer_0_type_2/matrix is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO layer_0_type_2/bias is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO layer_1_type_2/matrix is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO layer_1_type_2/bias is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO layer_1_type_2/idt is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO layer_2_type_2/matrix is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO layer_2_type_2/bias is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO layer_2_type_2/idt is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO final_layer_type_2/matrix is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO final_layer_type_2/bias is passed from old graph(<class 'numpy.float64'>) to raw graph(<class 'numpy.float64'>) DEEPMD INFO the output model is saved in 2.0-compress.pb


- compare the dp-test results of the two models:

root ISSUE-846 $ dp test -s data -m 2.0.pb WARNING:tensorflow:From /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term DEEPMD INFO # ---------------output of dp test--------------- DEEPMD INFO # testing system : data/init.998 DEEPMD INFO # number of test data : 10 DEEPMD INFO Energy RMSE : 5.110137e-01 eV DEEPMD INFO Energy RMSE/Natoms : 1.596918e-02 eV DEEPMD INFO Force RMSE : 1.223216e-02 eV/A DEEPMD INFO Virial RMSE : 2.932328e+01 eV DEEPMD INFO Virial RMSE/Natoms : 9.163526e-01 eV DEEPMD INFO # ----------------------------------------------- DEEPMD INFO # ---------------output of dp test--------------- DEEPMD INFO # testing system : data/init.999 DEEPMD INFO # number of test data : 10 DEEPMD INFO Energy RMSE : 3.589722e-01 eV DEEPMD INFO Energy RMSE/Natoms : 1.121788e-02 eV DEEPMD INFO Force RMSE : 1.296819e-02 eV/A DEEPMD INFO Virial RMSE : 2.883325e+01 eV DEEPMD INFO Virial RMSE/Natoms : 9.010390e-01 eV DEEPMD INFO # ----------------------------------------------- DEEPMD INFO # ---------------output of dp test--------------- DEEPMD INFO # testing system : data/init.997 DEEPMD INFO # number of test data : 10 DEEPMD INFO Energy RMSE : 5.086544e-01 eV DEEPMD INFO Energy RMSE/Natoms : 1.589545e-02 eV DEEPMD INFO Force RMSE : 1.187290e-02 eV/A DEEPMD INFO Virial RMSE : 2.975450e+01 eV DEEPMD INFO Virial RMSE/Natoms : 9.298282e-01 eV DEEPMD INFO # ----------------------------------------------- DEEPMD INFO # ----------weighted average of errors----------- DEEPMD INFO # number of systems : 3 DEEPMD INFO Energy RMSE/Natoms : 1.453181e-02 eV DEEPMD INFO Force RMSE : 1.236616e-02 eV/A DEEPMD INFO Virial RMSE/Natoms : 9.158155e-01 eV DEEPMD INFO # -----------------------------------------------

root ISSUE-846 $ dp test -s data -m 2.0-compress.pb WARNING:tensorflow:From /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term DEEPMD INFO # ---------------output of dp test--------------- DEEPMD INFO # testing system : data/init.998 DEEPMD INFO # number of test data : 10 DEEPMD INFO Energy RMSE : 5.110137e-01 eV DEEPMD INFO Energy RMSE/Natoms : 1.596918e-02 eV DEEPMD INFO Force RMSE : 1.223216e-02 eV/A DEEPMD INFO Virial RMSE : 2.932328e+01 eV DEEPMD INFO Virial RMSE/Natoms : 9.163526e-01 eV DEEPMD INFO # ----------------------------------------------- DEEPMD INFO # ---------------output of dp test--------------- DEEPMD INFO # testing system : data/init.999 DEEPMD INFO # number of test data : 10 DEEPMD INFO Energy RMSE : 3.589722e-01 eV DEEPMD INFO Energy RMSE/Natoms : 1.121788e-02 eV DEEPMD INFO Force RMSE : 1.296819e-02 eV/A DEEPMD INFO Virial RMSE : 2.883325e+01 eV DEEPMD INFO Virial RMSE/Natoms : 9.010390e-01 eV DEEPMD INFO # ----------------------------------------------- DEEPMD INFO # ---------------output of dp test--------------- DEEPMD INFO # testing system : data/init.997 DEEPMD INFO # number of test data : 10 DEEPMD INFO Energy RMSE : 5.086544e-01 eV DEEPMD INFO Energy RMSE/Natoms : 1.589545e-02 eV DEEPMD INFO Force RMSE : 1.187290e-02 eV/A DEEPMD INFO Virial RMSE : 2.975450e+01 eV DEEPMD INFO Virial RMSE/Natoms : 9.298282e-01 eV DEEPMD INFO # ----------------------------------------------- DEEPMD INFO # ----------weighted average of errors----------- DEEPMD INFO # number of systems : 3 DEEPMD INFO Energy RMSE/Natoms : 1.453181e-02 eV DEEPMD INFO Force RMSE : 1.236616e-02 eV/A DEEPMD INFO Virial RMSE/Natoms : 9.158155e-01 eV DEEPMD INFO # -----------------------------------------------

denghuilu commented 3 years ago

Here's the input_1.3.json:

{
    "model": {
        "descriptor": {
            "type": "se_a",
            "sel": [
                300,
                300,
                300
            ],
            "rcut_smth": 2.0,
            "rcut": 6.0,
            "neuron": [
                25,
                50,
                100
            ],
            "resnet_dt": false,
            "axis_neuron": 12,
            "type_one_side": true,
            "seed": 2687978781
        },
        "fitting_net": {
            "neuron": [
                240,
                240,
                240
            ],
            "resnet_dt": true,
            "seed": 2706322555
        },
        "type_map": [
            "Mg",
            "Al",
            "Cu"
        ]
    },
    "learning_rate": {
        "type": "exp",
        "start_lr": 0.001,
        "decay_steps": 80000
    },
    "loss": {
        "start_pref_e": 0.02,
        "limit_pref_e": 2,
        "start_pref_f": 1000,
        "limit_pref_f": 1,
        "start_pref_v": 0.0,
        "limit_pref_v": 0.0
    },
    "training": {
        "systems": [
            "/root/dp-bug-fix-related/ISSUE-846/data/init.997",
            "/root/dp-bug-fix-related/ISSUE-846/data/init.998",
            "/root/dp-bug-fix-related/ISSUE-846/data/init.999"

        ],
        "set_prefix": "set",
        "stop_batch": 16000000,
        "batch_size": "auto",
        "seed": 1520843097,
        "_comment": "that's all",
        "disp_file": "lcurve.out",
        "disp_freq": 2000,
        "numb_test": 1,
        "save_freq": 2000,
        "save_ckpt": "model.ckpt",
        "disp_training": true,
        "time_training": true,
        "profiling": false,
        "profiling_file": "timeline.json"
    }
}

denghuilu commented 3 years ago

and the out.json:

{
    "model": {
        "descriptor": {
            "type": "se_e2_a",
            "sel": [
                300,
                300,
                300
            ],
            "rcut_smth": 2.0,
            "rcut": 6.0,
            "neuron": [
                25,
                50,
                100
            ],
            "resnet_dt": false,
            "axis_neuron": 12,
            "type_one_side": true,
            "seed": 2687978781,
            "activation_function": "tanh",
            "precision": "float64",
            "trainable": true,
            "exclude_types": [],
            "set_davg_zero": false
        },
        "fitting_net": {
            "neuron": [
                240,
                240,
                240
            ],
            "resnet_dt": true,
            "seed": 2706322555,
            "type": "ener",
            "numb_fparam": 0,
            "numb_aparam": 0,
            "activation_function": "tanh",
            "precision": "float64",
            "trainable": true,
            "rcond": 0.001,
            "atom_ener": []
        },
        "type_map": [
            "Mg",
            "Al",
            "Cu"
        ],
        "data_stat_nbatch": 10,
        "data_stat_protect": 0.01
    },
    "learning_rate": {
        "type": "exp",
        "start_lr": 0.001,
        "decay_steps": 80000,
        "stop_lr": 1e-08
    },
    "loss": {
        "start_pref_e": 0.02,
        "limit_pref_e": 2,
        "start_pref_f": 1000,
        "limit_pref_f": 1,
        "start_pref_v": 0.0,
        "limit_pref_v": 0.0,
        "type": "ener",
        "start_pref_ae": 0.0,
        "limit_pref_ae": 0.0
    },
    "training": {
        "seed": 1520843097,
        "disp_file": "lcurve.out",
        "disp_freq": 2000,
        "numb_test": 1,
        "save_freq": 2000,
        "save_ckpt": "model.ckpt",
        "disp_training": true,
        "time_training": true,
        "profiling": false,
        "profiling_file": "timeline.json",
        "training_data": {
            "systems": [
                "/root/dp-bug-fix-related/ISSUE-846/data/init.997",
                "/root/dp-bug-fix-related/ISSUE-846/data/init.998",
                "/root/dp-bug-fix-related/ISSUE-846/data/init.999"
            ],
            "set_prefix": "set",
            "batch_size": "auto",
            "auto_prob": "prob_sys_size",
            "sys_probs": null
        },
        "numb_steps": 16000000,
        "validation_data": null,
        "tensorboard": false,
        "tensorboard_log_dir": "log"
    }
}

njzjz commented 3 years ago

We may need #727 to avoid such things.

njzjz commented 3 years ago

This is a bug (or unexpected breaking) in v1.3. model/fitting_net/n_neuron is not added to the alias of model/fitting_net/neuron in dargs, so it fallbacks to the default value [120,120,120]. @Vibsteamer you may need to check all of your previous model trained by v1.3.

Vibsteamer commented 3 years ago

This is a bug (or unexpected breaking) in v1.3. model/fitting_net/n_neuron is not added to the alias of model/fitting_net/neuron in dargs, so it fallbacks to the default value [120,120,120]. @Vibsteamer you may need to check all of your previous model trained by v1.3.

Thank you for your kind reply and endeavor.

and, LOL, I will check those previous models.

It seems the usage of deepmd-kit_v1.3 with dpgen may also encounter the similar issue. Maybe, some data sets generated in this way would be more or less redundant, especially for complex systems involving multiple elements/phases.

Best,

Time is a big nougat.

deepmodeling / deepmd-kit

[BUG] _convert '1.3' then compress | dim mismatch | 1.3.0 & 2.0beta3_ #846

Steps to Reproduce

Steps to Reproduce

I can get the correct result through the following steps:

DEEPMD: ____

DEEPMD: | \ | \ | \/ || _ \ | | ()| |

DEEPMD: | | | | | |) || \ / || | | | ____ | | _ | |_

DEEPMD: | | | | / \ / | _/ | |\/| || | | ||____|| |/ /| || __|

DEEPMD: | || || /| /| | | | | || || | | < | || |_

DEEPMD: |__/ \| _||| || |_||___/ ||_|| __|

DEEPMD:

DEEPMD: Please read and cite:

DEEPMD: Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)

DEEPMD:

DEEPMD: ---Summary of the training---------------------------------------

DEEPMD: installed to: /tmp/pip-req-build-_1b4jy64/_skbuild/linux-x86_64-3.6/cmake-install

DEEPMD: source : v1.2.2-167-g53b886c

DEEPMD: source brach: master

DEEPMD: source commit: 53b886c

DEEPMD: source commit at: 2021-06-21 14:19:24 +0800

DEEPMD: build float prec: double

DEEPMD: build with tf inc: /root/dp-master/tensorflow_venv/lib/python3.6/site-packages/tensorflow/include

DEEPMD: build with tf lib:

DEEPMD: running on: iZ2zeedzsx4jorjze9gyq7Z

DEEPMD: CUDA_VISIBLE_DEVICES: unset

DEEPMD: num_intra_threads: 0

DEEPMD: num_inter_threads: 0

DEEPMD: -----------------------------------------------------------------

DEEPMD:

DEEPMD: ---Summary of DataSystem------------------------------------------------

DEEPMD: found 3 system(s):

DEEPMD: system natoms bch_sz n_bch n_test prob

DEEPMD: -- bug-fix-related/ISSUE-846/data/init.997 32 1 10 1 0.333

DEEPMD: -- bug-fix-related/ISSUE-846/data/init.998 32 1 10 1 0.333

DEEPMD: -- bug-fix-related/ISSUE-846/data/init.999 32 1 10 1 0.333

DEEPMD: ------------------------------------------------------------------------

DEEPMD:

DEEPMD: training without frame parameter

DEEPMD: built lr

DEEPMD: built network

DEEPMD: built training

DEEPMD: initialize model from scratch

DEEPMD: start training at lr 1.00e-03 (== 1.00e-03), decay_step 80000, decay_rate 0.944061, final lr will be 1.00e-08

DEEPMD: batch 2000 training time 29.45 s, testing time 0.01 s

DEEPMD: saved checkpoint model.ckpt

DEEPMD: batch 4000 training time 28.40 s, testing time 0.01 s

DEEPMD: saved checkpoint model.ckpt

DEEPMD: | | | | / \ / | _/ | |\/| || | | ||__|| |/ /| || |