[BUG] `dp test` encounters `killed` problem on a single A100 machine where batch_size is only 10.

AnguseZhang commented 1 month ago

Bug summary

I prepare a system in mixed type format, which contains 413 frames of systems, and the number of atoms is 45.

If I directly perform the command dp test, it will report killed problem. If I designate the batch size, it will normally finish. But if the batch size increases to 10, it will report killed again. For such systems (45 atoms), the failure of DPA-2 model's inference is strange. Please have a look.

I'm not sure if this problem is related to https://github.com/deepmodeling/deepmd-kit/issues/3766.

(deepmd-pytorch) /home/data/zhangyz/20240520_dptest_debug> dp --pt test -m /home/data/zhangyz/20240520_dptest_debug/model.ckpt.pt -s /home/data/guomingyu/DP-CLEAN/aissq/HEA25_S/valid/45 -n 5
/opt/deepmd-kit-3.0.0/envs/deepmd-pytorch/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
[2024-05-20 07:58:47,895] DEEPMD WARNING To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
[2024-05-20 07:58:49,590] DEEPMD WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes * natoms). The default value is 1024.
[2024-05-20 07:58:50,602] DEEPMD WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes * natoms). The default value is 1024.
[2024-05-20 07:58:50,603] DEEPMD INFO    # ---------------output of dp test--------------- 
[2024-05-20 07:58:50,603] DEEPMD INFO    # testing system : /home/data/guomingyu/DP-CLEAN/aissq/HEA25_S/valid/45
[2024-05-20 07:59:13,302] DEEPMD INFO    # number of test data : 5 
[2024-05-20 07:59:13,302] DEEPMD INFO    Energy MAE         : 8.019694e+00 eV
[2024-05-20 07:59:13,302] DEEPMD INFO    Energy RMSE        : 8.844138e+00 eV
[2024-05-20 07:59:13,302] DEEPMD INFO    Energy MAE/Natoms  : 1.782154e-01 eV
[2024-05-20 07:59:13,302] DEEPMD INFO    Energy RMSE/Natoms : 1.965364e-01 eV
[2024-05-20 07:59:13,302] DEEPMD INFO    Force  MAE         : 2.158802e-01 eV/A
[2024-05-20 07:59:13,302] DEEPMD INFO    Force  RMSE        : 2.993930e-01 eV/A
[2024-05-20 07:59:13,302] DEEPMD INFO    Virial MAE         : 1.979567e+01 eV
[2024-05-20 07:59:13,302] DEEPMD INFO    Virial RMSE        : 3.396110e+01 eV
[2024-05-20 07:59:13,302] DEEPMD INFO    Virial MAE/Natoms  : 4.399037e-01 eV
[2024-05-20 07:59:13,302] DEEPMD INFO    Virial RMSE/Natoms : 7.546912e-01 eV
[2024-05-20 07:59:13,302] DEEPMD INFO    # ----------------------------------------------- 
(deepmd-pytorch) /home/data/zhangyz/20240520_dptest_debug> dp --pt test -m /home/data/zhangyz/20240520_dptest_debug/model.ckpt.pt -s /home/data/guomingyu/DP-CLEAN/aissq/HEA25_S/valid/45 -n 10
/opt/deepmd-kit-3.0.0/envs/deepmd-pytorch/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
[2024-05-20 07:59:30,896] DEEPMD WARNING To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
[2024-05-20 07:59:32,592] DEEPMD WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes * natoms). The default value is 1024.
[2024-05-20 07:59:33,598] DEEPMD WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes * natoms). The default value is 1024.
[2024-05-20 07:59:33,598] DEEPMD INFO    # ---------------output of dp test--------------- 
[2024-05-20 07:59:33,598] DEEPMD INFO    # testing system : /home/data/guomingyu/DP-CLEAN/aissq/HEA25_S/valid/45
Killed

DeePMD-kit Version

23f67a139dc63db87328d8dc9aeb7fa3f0f39049 (2024Q1)

Backend and its version

PyTorch 2.2.1+cu121

How did you download the software?

Built from source

Input Files, Running Commands, Error Log, etc.

See steps to reproduce

Steps to Reproduce

On PAI machine,

dp --pt test -m /home/data/zhangyz/20240520_dptest_debug/model.ckpt.pt -s /home/data/guomingyu/DP-CLEAN/aissq/HEA25_S/valid/45 -n 10

Further Information, Files, and Links

No response

AnguseZhang commented 1 month ago

Fine, I found the memory of previous machine is too small ( 12G), and I increased the memory, and it works.

njzjz commented 1 month ago

It seems you are using the CPU instead of the A100.

deepmodeling / deepmd-kit