deepmodeling / deepmd-kit

A deep learning package for many-body potential energy representation and molecular dynamics
https://docs.deepmodeling.com/projects/deepmd/
GNU Lesser General Public License v3.0
1.5k stars 511 forks source link

[BUG] An error when MC simulation in lammps #4207

Open Zch102xjtumse opened 1 month ago

Zch102xjtumse commented 1 month ago

Bug summary

Hello everyone. I met an error when I use the finetuned DPA2 model in the lammps MC simulation. The error informations is as below, I don't know what caused this.I'd appreciate it if you could help me with this.

DeePMD-kit Version

DeePMD-kit v3.0.0b4

Backend and its version

PyTorch v2.0.0.post200-gc263bd43e8e

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

ERROR on proc 2: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/torch/deepmd/pt/model/model/ener_model.py", line 56, in forward_lower comm_dict: Optional[Dict[str, Tensor]]=None) -> Dict[str, Tensor]: _5 = (self).need_sorted_nlist_for_lower() model_ret = (self).forward_common_lower(extended_coord, extended_atype, nlist, mapping, fparam, aparam, do_atomic_virial, comm_dict, _5, )


    _6 = (self).get_fitting_net()
    model_predict = annotate(Dict[str, Tensor], {})
  File "code/__torch__/deepmd/pt/model/model/ener_model.py", line 213, in forward_common_lower
    cc_ext, _36, fp, ap, input_prec, = _35
    atomic_model = self.atomic_model
    atomic_ret = (atomic_model).forward_common_atomic(cc_ext, extended_atype, nlist0, mapping, fp, ap, comm_dict, )
                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    _37 = (self).atomic_output_def()
    training = self.training
  File "code/__torch__/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 50, in forward_common_atomic
    ext_atom_mask = (self).make_atom_mask(extended_atype, )
    _3 = torch.where(ext_atom_mask, extended_atype, 0)
    ret_dict = (self).forward_atomic(extended_coord, _3, nlist, mapping, fparam, aparam, comm_dict, )
                ~~~~~~~~~~~~~~~~~~~~ <--- HERE
    ret_dict0 = (self).apply_out_stat(ret_dict, atype, )
    _4 = torch.slice(torch.slice(ext_atom_mask), 1, None, nloc)
  File "code/__torch__/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 93, in forward_atomic
      pass
    descriptor = self.descriptor
    _16 = (descriptor).forward(extended_coord, extended_atype, nlist, mapping, comm_dict, )
           ~~~~~~~~~~~~~~~~~~~ <--- HERE
    descriptor0, rot_mat, g2, h2, sw, = _16
    fitting_net = self.fitting_net
  File "code/__torch__/deepmd/pt/model/descriptor/dpa2.py", line 84, in forward
    repformers3 = self.repformers
    _17 = nlist_dict[_1(_16, (repformers3).get_nsel(), )]
    _18 = (repformers1).forward(_17, extended_coord, extended_atype, g11, mapping0, comm_dict0, )
           ~~~~~~~~~~~~~~~~~~~~ <--- HERE
    g12, g2, h2, rot_mat, sw, = _18
    concat_output_tebd = self.concat_output_tebd
  File "code/__torch__/deepmd/pt/model/descriptor/repformers.py", line 226, in forward
      _32 = torch.tensor(nloc)
      _33 = torch.tensor(torch.sub(nall, nloc))
      ret = ops.deepmd.border_op(_25, _26, _27, _28, _29, g10, _31, _32, _33)
            ~~~~~~~~~~~~~~~~~~~~ <--- HERE
      g1_ext, comm_dict6, mapping6 = torch.unsqueeze(ret[0], 0), comm_dict7, mapping2
    _34 = (_00).forward(g1_ext, g23, h2, nlist0, nlist_mask, sw1, )

Traceback of TorchScript, original code (most recent call last):
  File "/home/zhaochenhao/soft/deepmd3.0b3/lib/python3.10/site-packages/deepmd/pt/model/model/ener_model.py", line 109, in forward_lower
        comm_dict: Optional[Dict[str, torch.Tensor]] = None,
    ):
        model_ret = self.forward_common_lower(
                    ~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
            extended_coord,
            extended_atype,
  File "/home/zhaochenhao/soft/deepmd3.0b3/lib/python3.10/site-packages/deepmd/pt/model/model/make_model.py", line 261, in forward_common_lower
            )
            del extended_coord, fparam, aparam
            atomic_ret = self.atomic_model.forward_common_atomic(
                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
                cc_ext,
                extended_atype,
  File "/home/zhaochenhao/soft/deepmd3.0b3/lib/python3.10/site-packages/deepmd/pt/model/atomic_model/base_atomic_model.py", line 241, in forward_common_atomic

        ext_atom_mask = self.make_atom_mask(extended_atype)
        ret_dict = self.forward_atomic(
                   ~~~~~~~~~~~~~~~~~~~ <--- HERE
            extended_coord,
            torch.where(ext_atom_mask, extended_atype, 0),
  File "/home/zhaochenhao/soft/deepmd3.0b3/lib/python3.10/site-packages/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 189, in forward_atomic
        if self.do_grad_r() or self.do_grad_c():
            extended_coord.requires_grad_(True)
        descriptor, rot_mat, g2, h2, sw = self.descriptor(
                                          ~~~~~~~~~~~~~~~ <--- HERE
            extended_coord,
            extended_atype,
  File "/home/zhaochenhao/soft/deepmd3.0b3/lib/python3.10/site-packages/deepmd/pt/model/descriptor/dpa2.py", line 652, in forward
            g1 = g1_ext
        # repformer
        g1, g2, h2, rot_mat, sw = self.repformers(
                                  ~~~~~~~~~~~~~~~ <--- HERE
            nlist_dict[
                get_multiple_nlist_key(
  File "/home/zhaochenhao/soft/deepmd3.0b3/lib/python3.10/site-packages/deepmd/pt/model/descriptor/repformers.py", line 480, in forward
                assert "recv_num" in comm_dict
                assert "communicator" in comm_dict
                ret = torch.ops.deepmd.border_op(
                      ~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
                    comm_dict["send_list"],
                    comm_dict["send_proc"],
RuntimeError: Trying to create tensor with negative dimension -1873441304: [-1873441304]
 (/home/conda/feedstock_root/build_artifacts/deepmd-kit_1722057353391/work/source/lmp/pair_deepmd.cpp:586)
Last command: run 150000
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

### Steps to Reproduce

the lammps in.file is as follow:
label i
variable i loop 2
variable ts equal 0+300*$i
variable ta equal 0+300*$i
shell mkdir dpav1-${ta}
units        metal
boundary      p p p
atom_style    atomic
timestep      0.001
read_data     min.data
pair_style deepmd ../dpav1.pth
pair_coeff * * x x x
compute            1 all temp
compute            Ek all ke/atom
compute            Ep all pe/atom
compute_modify        1 dynamic yes
thermo_style          custom step dt time temp ke pe etotal press lx ly lz vol
thermo             100
dump       1         all custom 5000 dpav1-${ta}/dumpthermo.atom.* id type x y z c_Ek c_Ep
velocity  all create ${ts} 82765577 rot yes dist gaussian
fix r2 all npt temp ${ta} ${ta} 0.1 iso 0.0 0.0 1.0
fix mc4 all atom/swap 20 5 82765577 ${ts} types 1 2
fix mc5 all atom/swap 20 5 82765577 ${ts} types 1 3
fix mc6 all atom/swap 20 5 82765577 ${ts} types 2 3
run 100000
min_style       cg
minimize        1.0e-6 1.0e-7 10000 10000

clear
next i
jump SELF i

### Further Information, Files, and Links

_No response_
njzjz commented 1 month ago

How many atoms are there? It looks like an integer overflow bug. Could you provide files to reproduce the bug?

Zch102xjtumse commented 1 month ago

有多少个原子?它看起来像一个整数溢出错误。您能否提供文件来重现该错误? 108 Here are the files.file.zip

CaRoLZhangxy commented 2 weeks ago

有多少个原子?它看起来像一个整数溢出错误。您能否提供文件来重现该错误? 108 Here are the files.file.zip

I did not reproduce the error of this input on current devel branch with both single and multiprocess execution. It seems that this issue may be fixed on devel branch