I prepare a system in mixed type format, which contains 413 frames of systems, and the number of atoms is 45.
If I directly perform the command dp test, it will report killed problem. If I designate the batch size, it will normally finish. But if the batch size increases to 10, it will report killed again. For such systems (45 atoms), the failure of DPA-2 model's inference is strange. Please have a look.
(deepmd-pytorch) /home/data/zhangyz/20240520_dptest_debug> dp --pt test -m /home/data/zhangyz/20240520_dptest_debug/model.ckpt.pt -s /home/data/guomingyu/DP-CLEAN/aissq/HEA25_S/valid/45 -n 5
/opt/deepmd-kit-3.0.0/envs/deepmd-pytorch/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
[2024-05-20 07:58:47,895] DEEPMD WARNING To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
[2024-05-20 07:58:49,590] DEEPMD WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes * natoms). The default value is 1024.
[2024-05-20 07:58:50,602] DEEPMD WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes * natoms). The default value is 1024.
[2024-05-20 07:58:50,603] DEEPMD INFO # ---------------output of dp test---------------
[2024-05-20 07:58:50,603] DEEPMD INFO # testing system : /home/data/guomingyu/DP-CLEAN/aissq/HEA25_S/valid/45
[2024-05-20 07:59:13,302] DEEPMD INFO # number of test data : 5
[2024-05-20 07:59:13,302] DEEPMD INFO Energy MAE : 8.019694e+00 eV
[2024-05-20 07:59:13,302] DEEPMD INFO Energy RMSE : 8.844138e+00 eV
[2024-05-20 07:59:13,302] DEEPMD INFO Energy MAE/Natoms : 1.782154e-01 eV
[2024-05-20 07:59:13,302] DEEPMD INFO Energy RMSE/Natoms : 1.965364e-01 eV
[2024-05-20 07:59:13,302] DEEPMD INFO Force MAE : 2.158802e-01 eV/A
[2024-05-20 07:59:13,302] DEEPMD INFO Force RMSE : 2.993930e-01 eV/A
[2024-05-20 07:59:13,302] DEEPMD INFO Virial MAE : 1.979567e+01 eV
[2024-05-20 07:59:13,302] DEEPMD INFO Virial RMSE : 3.396110e+01 eV
[2024-05-20 07:59:13,302] DEEPMD INFO Virial MAE/Natoms : 4.399037e-01 eV
[2024-05-20 07:59:13,302] DEEPMD INFO Virial RMSE/Natoms : 7.546912e-01 eV
[2024-05-20 07:59:13,302] DEEPMD INFO # -----------------------------------------------
(deepmd-pytorch) /home/data/zhangyz/20240520_dptest_debug> dp --pt test -m /home/data/zhangyz/20240520_dptest_debug/model.ckpt.pt -s /home/data/guomingyu/DP-CLEAN/aissq/HEA25_S/valid/45 -n 10
/opt/deepmd-kit-3.0.0/envs/deepmd-pytorch/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
[2024-05-20 07:59:30,896] DEEPMD WARNING To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
[2024-05-20 07:59:32,592] DEEPMD WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes * natoms). The default value is 1024.
[2024-05-20 07:59:33,598] DEEPMD WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes * natoms). The default value is 1024.
[2024-05-20 07:59:33,598] DEEPMD INFO # ---------------output of dp test---------------
[2024-05-20 07:59:33,598] DEEPMD INFO # testing system : /home/data/guomingyu/DP-CLEAN/aissq/HEA25_S/valid/45
Killed
DeePMD-kit Version
23f67a139dc63db87328d8dc9aeb7fa3f0f39049 (2024Q1)
Backend and its version
PyTorch 2.2.1+cu121
How did you download the software?
Built from source
Input Files, Running Commands, Error Log, etc.
See steps to reproduce
Steps to Reproduce
On PAI machine,
dp --pt test -m /home/data/zhangyz/20240520_dptest_debug/model.ckpt.pt -s /home/data/guomingyu/DP-CLEAN/aissq/HEA25_S/valid/45 -n 10
Bug summary
I prepare a system in mixed type format, which contains 413 frames of systems, and the number of atoms is 45.
If I directly perform the command
dp test
, it will reportkilled
problem. If I designate the batch size, it will normally finish. But if the batch size increases to 10, it will reportkilled
again. For such systems (45 atoms), the failure ofDPA-2
model's inference is strange. Please have a look.I'm not sure if this problem is related to https://github.com/deepmodeling/deepmd-kit/issues/3766.
DeePMD-kit Version
23f67a139dc63db87328d8dc9aeb7fa3f0f39049 (2024Q1)
Backend and its version
PyTorch 2.2.1+cu121
How did you download the software?
Built from source
Input Files, Running Commands, Error Log, etc.
See steps to reproduce
Steps to Reproduce
On PAI machine,
Further Information, Files, and Links
No response