deepmodeling / deepmd-kit

A deep learning package for many-body potential energy representation and molecular dynamics
https://docs.deepmodeling.com/projects/deepmd/
GNU Lesser General Public License v3.0
1.45k stars 500 forks source link

[BUG] Model converted from PT to TF backend could not run with TF #3997

Closed Cloudac7 closed 1 month ago

Cloudac7 commented 1 month ago

Bug summary

I am now working on multi-task training with DeePMD-kit v3.0.0b0, and I get a header with se_a descriptor after freezing step. Then, I tried to use dp --pt convert-backend frozen_model.pth frozen_model.pb (and without--pt, getting the same result.) to get a frozen_model.pb. But it could not be used when running Lammps with both v2.2.9 and v3.0.0b0, raising the following error:

Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.0005
INVALID_ARGUMENT: 2 root error(s) found.
  (0) INVALID_ARGUMENT: Input to reshape is a tensor with 504000 values, but the requested shape requires a multiple of 1608
     [[{{node Reshape_33}}]]
     [[o_atom_energy/_37]]
  (1) INVALID_ARGUMENT: Input to reshape is a tensor with 504000 values, but the requested shape requires a multiple of 1608
     [[{{node Reshape_33}}]]
0 successful operations.
0 derived errors ignored.
ERROR on proc 0: DeePMD-kit C API Error: DeePMD-kit Error: TensorFlow Error: INVALID_ARGUMENT: 2 root error(s) found.
  (0) INVALID_ARGUMENT: Input to reshape is a tensor with 504000 values, but the requested shape requires a multiple of 1608
     [[{{node Reshape_33}}]]
     [[o_atom_energy/_37]]
  (1) INVALID_ARGUMENT: Input to reshape is a tensor with 504000 values, but the requested shape requires a multiple of 1608
     [[{{node Reshape_33}}]]
0 successful operations.
0 derived errors ignored. (/public/groups/ai4ec/libs/conda/deepmd/3.0.0b0-cuda118/source/deepmd-kit/source/lmp/pair_deepmd.cpp:586)
Last command: run             ${NSTEPS} upto

It seems something wrong when converting the model, and seems to be a bug.

DeePMD-kit Version

DeePMD-kit v3.0.0b0

Backend and its version

PyTorch v2.0.0.post200, TensorFlow v2.14.0

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

Running command:

dp --pt freeze -o frozen_model.pth --head ener
dp convert-backend frozen_model.pth frozen_model.pb

or use --pt.

And the Lammps error log is under below. slurm-2623892.txt

Steps to Reproduce

Please use the following frozen_model.pth to freeze and use the following Lammps task to reproduce the bug.

Further Information, Files, and Links

No response

njzjz commented 1 month ago

DescrptDPA1Compat has the wrong get_dim_out() when concat_output_tebd is true. cc @iProzd

njzjz commented 1 month ago

Fixed in #4007.