Closed wujing81 closed 1 month ago
@wujing81 Apologies for the confusion during installation; I faced the same issue while debugging.
The problem arises because DPA2 requires the border_op
module, which depends on enabling PyTorch support during installation. You can do this by using the following command:
DP_VARIANT=cuda DP_ENABLE_PYTORCH=1 pip install git+https://github.com/deepmodeling/deepmd-kit.git@devel
But why is this option False in default? @njzjz @CaRoLZhangxy To my understanding, users who want to use dpa2 model with lammps must need this option. BTW, the doc mentioned this option here may be not so clear? https://docs.deepmodeling.com/projects/deepmd/en/latest/install/install-from-source.html#envvar-DP_ENABLE_PYTORCH
But why is this option False in default?
xref: https://github.com/deepmodeling/deepmd-kit/pull/3891#issuecomment-2181707561
I am not going to change the default option to True until PyTorch fixes https://github.com/pytorch/pytorch/issues/78530.
Summary
I created a container node registry.dp.tech/dptech/deepmd-kit:3.0.0b3-cuda12.1 using the Bourium platform. Then I installed the devel branch of DeepMD-kit with:
conda create -n deepmd-dev python=3.10
source activate deepmd-dev
pip install git+https://github.com/deepmodeling/deepmd-kit.git@devel
rsync -a --ignore-existing /opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/ /opt/deepmd-kit-3.0.0b3/
The command /opt/deepmd-kit-3.0.0b3/bin/dp --version displays: DeePMD-kit v3.0.0b4.dev56+g0b72dae3. I trained a model using this version of dp, and the training input file is attached. I used dp --pt freeze to get a .pth file. Then, I used this model to run MD simulations with the command /opt/deepmd-kit-3.0.0b3/bin/lmp -i lammps.in. The input.lammps and conf.lmp files are attached. An error occurs: [bohrium-11849-1195151:01982] mca_base_component_repository_open: unable to open mca_btl_openib: librdmacm.so.1: cannot open shared object file: No such file or directory (ignored) LAMMPS (2 Aug 2023) OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98) using 1 OpenMP thread(s) per MPI task DeePMD-kit: Successfully load libcudart.so.11.0 2024-09-24 15:37:29.837816: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-09-24 15:37:29.837871: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-09-24 15:37:29.837882: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered Loaded 1 plugins from /opt/deepmd-kit-3.0.0b3/lib/deepmd_lmp Reading data file ... triclinic box = (0 0 0) to (12.4447 12.4447 12.4447) with tilt (0 0 0) 1 by 1 by 1 MPI processor grid reading atoms ... 192 atoms read_data CPU = 0.003 seconds DeePMD-kit WARNING: Environmental variable DP_INTRA_OP_PARALLELISM_THREADS is not set. Tune DP_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information. DeePMD-kit WARNING: Environmental variable DP_INTER_OP_PARALLELISM_THREADS is not set. Tune DP_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information. DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information. Summary of lammps deepmd module ...CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE
Your simulation uses code contributions which should be cited:
CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE
Generated 0 of 1 mixed pair_coeff terms from geometric mixing rule Neighbor list info ... update: every = 10 steps, delay = 0 steps, check = no max neighbors/atom: 2000, page size: 100000 master list distance cutoff = 6.5 ghost atom cutoff = 6.5 binsize = 3.25, bins = 4 4 4 1 neighbor lists, perpetual/occasional/extra = 1 0 0 (1) pair deepmd, perpetual attributes: full, newton on pair build: full/bin/atomonly stencil: full/bin/3d bin: standard Setting up Verlet run ... Unit style : metal Current step : 0 Time step : 0.0005 ERROR on proc 0: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend JIT error: The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/torch/deepmd/pt/model/model/ener_model.py", line 56, in forward_lower comm_dict: Optional[Dict[str, Tensor]]=None) -> Dict[str, Tensor]: _5 = (self).need_sorted_nlist_for_lower() model_ret = (self).forward_common_lower(extended_coord, extended_atype, nlist, mapping, fparam, aparam, do_atomic_virial, comm_dict, _5, )
builtins.NotImplementedError: border_op is not available since customized PyTorch OP library is not built when freezing the model. (/home/conda/feedstock_root/build_artifacts/deepmd-kit_1722057353391/work/source/lmp/pair_deepmd.cpp:586) Last command: run 1000
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.
DeePMD-kit Version
DeePMD-kit v3.0.0b4.dev56+g0b72dae3
Backend and its version
PyTorch v2.4.1+cu121-g38b96d3399a
Python Version, CUDA Version, GCC Version, LAMMPS Version, etc
No response
Details
input.zip