deepmodeling / deepmd-kit

A deep learning package for many-body potential energy representation and molecular dynamics
https://docs.deepmodeling.com/projects/deepmd/
GNU Lesser General Public License v3.0
1.51k stars 516 forks source link

Error when running LAMMPS in the devel branch #4161

Closed wujing81 closed 1 month ago

wujing81 commented 2 months ago

Summary

I created a container node registry.dp.tech/dptech/deepmd-kit:3.0.0b3-cuda12.1 using the Bourium platform. Then I installed the devel branch of DeepMD-kit with: conda create -n deepmd-dev python=3.10 source activate deepmd-dev pip install git+https://github.com/deepmodeling/deepmd-kit.git@devel rsync -a --ignore-existing /opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/ /opt/deepmd-kit-3.0.0b3/ The command /opt/deepmd-kit-3.0.0b3/bin/dp --version displays: DeePMD-kit v3.0.0b4.dev56+g0b72dae3. I trained a model using this version of dp, and the training input file is attached. I used dp --pt freeze to get a .pth file. Then, I used this model to run MD simulations with the command /opt/deepmd-kit-3.0.0b3/bin/lmp -i lammps.in. The input.lammps and conf.lmp files are attached. An error occurs: [bohrium-11849-1195151:01982] mca_base_component_repository_open: unable to open mca_btl_openib: librdmacm.so.1: cannot open shared object file: No such file or directory (ignored) LAMMPS (2 Aug 2023) OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98) using 1 OpenMP thread(s) per MPI task DeePMD-kit: Successfully load libcudart.so.11.0 2024-09-24 15:37:29.837816: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-09-24 15:37:29.837871: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-09-24 15:37:29.837882: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered Loaded 1 plugins from /opt/deepmd-kit-3.0.0b3/lib/deepmd_lmp Reading data file ... triclinic box = (0 0 0) to (12.4447 12.4447 12.4447) with tilt (0 0 0) 1 by 1 by 1 MPI processor grid reading atoms ... 192 atoms read_data CPU = 0.003 seconds DeePMD-kit WARNING: Environmental variable DP_INTRA_OP_PARALLELISM_THREADS is not set. Tune DP_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information. DeePMD-kit WARNING: Environmental variable DP_INTER_OP_PARALLELISM_THREADS is not set. Tune DP_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information. DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information. Summary of lammps deepmd module ...

Info of deepmd-kit: installed to: /opt/deepmd-kit-3.0.0b3 source:
source branch: HEAD source commit: cbf2de6 source commit at: 2024-07-27 05:11:58 +0000 support model ver.: 1.1 build variant: cuda build with tf inc: /opt/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/tensorflow/include;/opt/deepmd-kit-3.0.0b3/include build with tf lib: /opt/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/tensorflow/libtensorflow_cc.so.2 build with pt lib: torch;torch_library;/opt/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/torch/lib/libc10.so;/usr/local/cuda/lib64/stubs/libcuda.so;/usr/local/cuda/lib64/libnvrtc.so;/usr/local/cuda/lib64/libnvToolsExt.so;/usr/local/cuda/lib64/libcudart.so;/opt/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/torch/lib/libc10_cuda.so set tf intra_op_parallelism_threads: 0 set tf inter_op_parallelism_threads: 0 Info of lammps module: use deepmd-kit at: /opt/deepmd-kit-3.0.0b3load model from: model.pth to cpu DeePMD-kit WARNING: Environmental variable DP_INTRA_OP_PARALLELISM_THREADS is not set. Tune DP_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information. DeePMD-kit WARNING: Environmental variable DP_INTER_OP_PARALLELISM_THREADS is not set. Tune DP_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information. DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information. Info of model(s): using 1 model(s): model.pth rcut in model: 4.5 ntypes in model: 118

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Generated 0 of 1 mixed pair_coeff terms from geometric mixing rule Neighbor list info ... update: every = 10 steps, delay = 0 steps, check = no max neighbors/atom: 2000, page size: 100000 master list distance cutoff = 6.5 ghost atom cutoff = 6.5 binsize = 3.25, bins = 4 4 4 1 neighbor lists, perpetual/occasional/extra = 1 0 0 (1) pair deepmd, perpetual attributes: full, newton on pair build: full/bin/atomonly stencil: full/bin/3d bin: standard Setting up Verlet run ... Unit style : metal Current step : 0 Time step : 0.0005 ERROR on proc 0: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend JIT error: The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/torch/deepmd/pt/model/model/ener_model.py", line 56, in forward_lower comm_dict: Optional[Dict[str, Tensor]]=None) -> Dict[str, Tensor]: _5 = (self).need_sorted_nlist_for_lower() model_ret = (self).forward_common_lower(extended_coord, extended_atype, nlist, mapping, fparam, aparam, do_atomic_virial, comm_dict, _5, )

    _6 = (self).get_fitting_net()
    model_predict = annotate(Dict[str, Tensor], {})
  File "code/__torch__/deepmd/pt/model/model/ener_model.py", line 213, in forward_common_lower
    cc_ext, _36, fp, ap, input_prec, = _35
    atomic_model = self.atomic_model
    atomic_ret = (atomic_model).forward_common_atomic(cc_ext, extended_atype, nlist0, mapping, fp, ap, comm_dict, )
                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    _37 = (self).atomic_output_def()
    training = self.training
  File "code/__torch__/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 50, in forward_common_atomic
    ext_atom_mask = (self).make_atom_mask(extended_atype, )
    _3 = torch.where(ext_atom_mask, extended_atype, 0)
    ret_dict = (self).forward_atomic(extended_coord, _3, nlist, mapping, fparam, aparam, comm_dict, )
                ~~~~~~~~~~~~~~~~~~~~ <--- HERE
    ret_dict0 = (self).apply_out_stat(ret_dict, atype, )
    _4 = torch.slice(torch.slice(ext_atom_mask), 1, None, nloc)
  File "code/__torch__/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 93, in forward_atomic
      pass
    descriptor = self.descriptor
    _16 = (descriptor).forward(extended_coord, extended_atype, nlist, mapping, comm_dict, )
           ~~~~~~~~~~~~~~~~~~~ <--- HERE
    descriptor0, rot_mat, g2, h2, sw, = _16
    fitting_net = self.fitting_net
  File "code/__torch__/deepmd/pt/model/descriptor/dpa2.py", line 98, in forward
    repformers1 = self.repformers
    _17 = nlist_dict[_1(_16, (repformers1).get_nsel(), )]
    _18 = (repformers).forward(_17, extended_coord, extended_atype, g13, mapping0, comm_dict0, )
           ~~~~~~~~~~~~~~~~~~~ <--- HERE
    g14, g2, h2, rot_mat, sw, = _18
    concat_output_tebd = self.concat_output_tebd
  File "code/__torch__/deepmd/pt/model/descriptor/repformers.py", line 364, in forward
  _65 = "border_op is not available since customized PyTorch OP library is not built when freezing the model."
  _66 = uninitialized(Tensor)
  ops.prim.RaiseException(_65, "builtins.NotImplementedError")
  ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
  return _66

Traceback of TorchScript, original code (most recent call last):
  File "/opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/lib/python3.10/site-packages/deepmd/pt/model/model/ener_model.py", line 109, in forward_lower
        comm_dict: Optional[Dict[str, torch.Tensor]] = None,
    ):
        model_ret = self.forward_common_lower(
                    ~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
            extended_coord,
            extended_atype,
  File "/opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/lib/python3.10/site-packages/deepmd/pt/model/model/make_model.py", line 261, in forward_common_lower
            )
            del extended_coord, fparam, aparam
            atomic_ret = self.atomic_model.forward_common_atomic(
                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
                cc_ext,
                extended_atype,
  File "/opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/lib/python3.10/site-packages/deepmd/pt/model/atomic_model/base_atomic_model.py", line 242, in forward_common_atomic

        ext_atom_mask = self.make_atom_mask(extended_atype)
        ret_dict = self.forward_atomic(
                   ~~~~~~~~~~~~~~~~~~~ <--- HERE
            extended_coord,
            torch.where(ext_atom_mask, extended_atype, 0),
  File "/opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/lib/python3.10/site-packages/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 189, in forward_atomic
        if self.do_grad_r() or self.do_grad_c():
            extended_coord.requires_grad_(True)
        descriptor, rot_mat, g2, h2, sw = self.descriptor(
                                          ~~~~~~~~~~~~~~~ <--- HERE
            extended_coord,
            extended_atype,
  File "/opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/lib/python3.10/site-packages/deepmd/pt/model/descriptor/dpa2.py", line 799, in forward
            g1 = g1_ext
        # repformer
        g1, g2, h2, rot_mat, sw = self.repformers(
                                  ~~~~~~~~~~~~~~~ <--- HERE
            nlist_dict[
                get_multiple_nlist_key(
  File "/opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/lib/python3.10/site-packages/deepmd/pt/model/descriptor/repformers.py", line 62, in forward
        argument8,
    ) -> torch.Tensor:
        raise NotImplementedError(
        "border_op is not available since customized PyTorch OP library is not built when freezing the model."
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    )

builtins.NotImplementedError: border_op is not available since customized PyTorch OP library is not built when freezing the model. (/home/conda/feedstock_root/build_artifacts/deepmd-kit_1722057353391/work/source/lmp/pair_deepmd.cpp:586) Last command: run 1000

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

DeePMD-kit Version

DeePMD-kit v3.0.0b4.dev56+g0b72dae3

Backend and its version

PyTorch v2.4.1+cu121-g38b96d3399a

Python Version, CUDA Version, GCC Version, LAMMPS Version, etc

No response

Details

input.zip

iProzd commented 2 months ago

@wujing81 Apologies for the confusion during installation; I faced the same issue while debugging.

The problem arises because DPA2 requires the border_op module, which depends on enabling PyTorch support during installation. You can do this by using the following command: DP_VARIANT=cuda DP_ENABLE_PYTORCH=1 pip install git+https://github.com/deepmodeling/deepmd-kit.git@devel

But why is this option False in default? @njzjz @CaRoLZhangxy To my understanding, users who want to use dpa2 model with lammps must need this option. BTW, the doc mentioned this option here may be not so clear? https://docs.deepmodeling.com/projects/deepmd/en/latest/install/install-from-source.html#envvar-DP_ENABLE_PYTORCH

njzjz commented 2 months ago

But why is this option False in default?

xref: https://github.com/deepmodeling/deepmd-kit/pull/3891#issuecomment-2181707561

I am not going to change the default option to True until PyTorch fixes https://github.com/pytorch/pytorch/issues/78530.