deepmodeling / deepmd-kit

A deep learning package for many-body potential energy representation and molecular dynamics
https://docs.deepmodeling.com/projects/deepmd/
GNU Lesser General Public License v3.0
1.41k stars 487 forks source link

Unable to finetune a model containing descriptor "se_atten_v2" with --init-frz-model argument #3950

Closed RA-zoL closed 1 week ago

RA-zoL commented 1 week ago

Bug summary

When I finetune a previous model with --init-frz-model argument, I get the error: AssertionError: currently, only support weight matrix and bias matrix at the tabulation op!. This model is obtained via dp freeze and includes the descriptor se_atten_v2. When the previous model does not include the se_atten_v2 descriptor, the --init-frz-model works.

Besides, I tried starting the finetuning from the checkpoint of the previous model with --init-model argument. The training could be finished, but the model shew little improvements on the new dataset.

DeePMD-kit Version

2.2.7 cuda11.6 gpu

Backend and its version

Tensorflow v2.9.0

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

Input file: "model": { "type_map": [ "H", "C", "N", "O" ], "descriptor": { "type": "se_atten_v2", "sel": 32, "rcut_smth": 0.5, "rcut": 6.0, "neuron": [ 30, 60, 120 ], "resnet_dt": true, "axis_neuron": 16, "attn": 128, "attn_layer": 3, "attn_mask": false, "attn_dotr": true, "_stripped_type_embedding": true, "_smooth_type_embdding": true, "set_davg_zero": false, "seed": 1754569613 },

Running command: dp train --skip-neighbor-stat --init-frz-model test.pb test_job.json

Error log: DEEPMD INFO -------------------------------------------------------------------------------------- DEEPMD INFO training without frame parameter Traceback (most recent call last): File "/users/lguoao/deepmd/bin/dp", line 10, in <module> sys.exit(main()) File "/users/lguoao/deepmd/lib/python3.10/site-packages/deepmd_cli/main.py", line 635, in main deepmd_main(args) File "/users/lguoao/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 74, in main train_dp(**dict_args) File "/users/lguoao/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 168, in train _do_work(jdata, run_opt, is_compress) File "/users/lguoao/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 280, in _do_work model.build(train_data, stop_batch, origin_type_map=origin_type_map) File "/users/lguoao/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 284, in build self._init_from_frz_model() File "/users/lguoao/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 1075, in _init_from_frz_model self.model.init_variables(graph, graph_def, model_type=self.model_type) File "/users/lguoao/deepmd/lib/python3.10/site-packages/deepmd/model/ener.py", line 369, in init_variables self.descrpt.init_variables(graph, graph_def, suffix=suffix) File "/users/lguoao/deepmd/lib/python3.10/site-packages/deepmd/descriptor/se_atten.py", line 1293, in init_variables super().init_variables(graph=graph, graph_def=graph_def, suffix=suffix) File "/users/lguoao/deepmd/lib/python3.10/site-packages/deepmd/descriptor/se_a.py", line 1277, in init_variables super().init_variables(graph=graph, graph_def=graph_def, suffix=suffix) File "/users/lguoao/deepmd/lib/python3.10/site-packages/deepmd/descriptor/se.py", line 130, in init_variables self.embedding_net_variables = get_embedding_net_variables_from_graph_def( File "/users/lguoao/deepmd/lib/python3.10/site-packages/deepmd/utils/graph.py", line 223, in get_embedding_net_variables_from_graph_def embedding_net_nodes = get_embedding_net_nodes_from_graph_def( File "/users/lguoao/deepmd/lib/python3.10/site-packages/deepmd/utils/graph.py", line 181, in get_embedding_net_nodes_from_graph_def key.find("bias") > 0 or key.find("matrix") > 0 AssertionError: currently, only support weight matrix and bias matrix at the tabulation op!

Steps to Reproduce

  1. Train a model using the "se_atten_v2" descriptor and freeze it with dp freeze -o test.pb;
  2. Restarting another training job using the --init-frz-model test.pb argument will reproduce the error.

Further Information, Files, and Links

No response