Open PhelanShao opened 1 month ago
I use
v3.0.0a1.dev81+g23f67a13
DeePMD-kit version2.0.0+cu117
Pytorch version.I can't reproduce the same error as yours.
My output is: (base) root@bohrium-11461-1141514:~/issue# dp --pt test -m model.ckpt-500000.pt -s merged_validation_data [2024-05-29 14:43:43,099] DEEPMD WARNING To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information. [2024-05-29 14:43:48,504] DEEPMD INFO # ---------------output of dp test--------------- [2024-05-29 14:43:48,504] DEEPMD INFO # testing system : merged_validation_data/C2O29H4 [2024-05-29 14:43:52,597] DEEPMD INFO Adjust batch size from 1024 to 512 /opt/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py:1501: UserWarning: operator() sees varying value in profiling, ignoring and this should be handled by GUARD logic (Triggered internally at ../third_party/nvfuser/csrc/parser.cpp:3777.) return forward_call(*args, **kwargs) [2024-05-29 14:43:54,826] DEEPMD INFO Adjust batch size from 512 to 256 [2024-05-29 14:46:47,689] DEEPMD INFO # number of test data : 1404 [2024-05-29 14:46:47,690] DEEPMD INFO Energy MAE : 9.018451e-01 eV [2024-05-29 14:46:47,690] DEEPMD INFO Energy RMSE : 8.481161e+00 eV [2024-05-29 14:46:47,690] DEEPMD INFO Energy MAE/Natoms : 2.576700e-02 eV [2024-05-29 14:46:47,690] DEEPMD INFO Energy RMSE/Natoms : 2.423189e-01 eV [2024-05-29 14:46:47,690] DEEPMD INFO Force MAE : 2.593189e-01 eV/A [2024-05-29 14:46:47,690] DEEPMD INFO Force RMSE : 1.283905e+00 eV/A [2024-05-29 14:46:47,690] DEEPMD INFO Virial MAE : 3.679828e+00 eV [2024-05-29 14:46:47,690] DEEPMD INFO Virial RMSE : 5.281812e+00 eV [2024-05-29 14:46:47,690] DEEPMD INFO Virial MAE/Natoms : 1.051379e-01 eV [2024-05-29 14:46:47,690] DEEPMD INFO Virial RMSE/Natoms : 1.509089e-01 eV [2024-05-29 14:46:47,690] DEEPMD INFO # ----------------------------------------------- [2024-05-29 14:46:47,690] DEEPMD INFO # ---------------output of dp test--------------- [2024-05-29 14:46:47,690] DEEPMD INFO # testing system : merged_validation_data/C2O3H4 [2024-05-29 14:47:44,314] DEEPMD INFO # number of test data : 2208 [2024-05-29 14:47:44,314] DEEPMD INFO Energy MAE : 6.144467e-01 eV [2024-05-29 14:47:44,314] DEEPMD INFO Energy RMSE : 4.617050e+00 eV [2024-05-29 14:47:44,314] DEEPMD INFO Energy MAE/Natoms : 6.827186e-02 eV [2024-05-29 14:47:44,314] DEEPMD INFO Energy RMSE/Natoms : 5.130056e-01 eV [2024-05-29 14:47:44,314] DEEPMD INFO Force MAE : 3.096173e-01 eV/A [2024-05-29 14:47:44,314] DEEPMD INFO Force RMSE : 6.542296e-01 eV/A [2024-05-29 14:47:44,314] DEEPMD INFO Virial MAE : 2.412724e+00 eV [2024-05-29 14:47:44,314] DEEPMD INFO Virial RMSE : 3.296499e+00 eV [2024-05-29 14:47:44,314] DEEPMD INFO Virial MAE/Natoms : 2.680805e-01 eV [2024-05-29 14:47:44,314] DEEPMD INFO Virial RMSE/Natoms : 3.662777e-01 eV [2024-05-29 14:47:44,314] DEEPMD INFO # ----------------------------------------------- [2024-05-29 14:47:44,314] DEEPMD INFO # ----------weighted average of errors----------- [2024-05-29 14:47:44,314] DEEPMD INFO # number of systems : 2 [2024-05-29 14:47:44,315] DEEPMD INFO Energy MAE : 7.261597e-01 eV [2024-05-29 14:47:44,315] DEEPMD INFO Energy RMSE : 6.402392e+00 eV [2024-05-29 14:47:44,315] DEEPMD INFO Energy MAE/Natoms : 5.175004e-02 eV [2024-05-29 14:47:44,315] DEEPMD INFO Energy RMSE/Natoms : 4.286043e-01 eV [2024-05-29 14:47:44,315] DEEPMD INFO Force MAE : 2.738023e-01 eV/A [2024-05-29 14:47:44,315] DEEPMD INFO Force RMSE : 1.138859e+00 eV/A [2024-05-29 14:47:44,315] DEEPMD INFO Virial MAE : 2.905253e+00 eV [2024-05-29 14:47:44,315] DEEPMD INFO Virial RMSE : 4.181720e+00 eV [2024-05-29 14:47:44,315] DEEPMD INFO Virial MAE/Natoms : 2.047440e-01 eV [2024-05-29 14:47:44,315] DEEPMD INFO Virial RMSE/Natoms : 3.014352e-01 eV [2024-05-29 14:47:44,315] DEEPMD INFO # -----------------------------------------------
@Chengqian-Zhang could you find a machine with large memory and set DP_INFER_BATCH_SIZE
to 1024
? I am wondering DP_INFER_BATCH_SIZE=256
may be not able to reproduce the issue,
@Chengqian-Zhang could you find a machine with large memory and set
DP_INFER_BATCH_SIZE
to1024
? I am wonderingDP_INFER_BATCH_SIZE=256
may be not able to reproduce the issue,
@njzjz I use a machine with large memory(GPU Memory = 4 * 32 GB , Memory = 368GB) and set DP_INFER_BATCH_SIZE
to 1024
. But it still Adjust batch size from 1024 to 512
. How can I solve it?
I removed the part in the code that automatically adjust batch size. And I find that when DP_INFER_BATCH_SIZE = 512
, no error occurs. When DP_INFER_BATCH_SIZE = 1024
, CUDA OOM
happens, only single GPU(32GB) is used. How can I use four GPUs to do dp test
?
GPU Memory = 4 * 32 GB , Memory = 368GB
Using the CPU.
GPU Memory = 4 * 32 GB , Memory = 368GB
Using the CPU.
I use a c52_m384_cpu
machine with 384GB memory. I successfully reproduce this error. No matter I set DP_INFER_BATCH_SIZE
to 1-->1024
, the error always occurs.
(base) root@bohrium-11461-1141514:~/issue# dp --pt test -m model.ckpt-500000.pt -s merged_validation_data
[2024-06-05 14:30:29,021] DEEPMD WARNING To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
[2024-06-05 14:30:32,283] DEEPMD INFO # ---------------output of dp test---------------
[2024-06-05 14:30:32,283] DEEPMD INFO # testing system : merged_validation_data/C2O29H4
Traceback (most recent call last):
File "/opt/mamba/bin/dp", line 8, in
@Chengqian-Zhang could you comment the torch jit, which will give the better traceback.
@Chengqian-Zhang could you comment the torch jit, which will give the better traceback.
deepmd-kit/deepmd/pt/infer/deep_eval.py
Line 123 in eb474d4
model = torch.jit.script(model)
@njzjz After I comment the torch jit, the error disappers and everything works well.....
(base) root@bohrium-11461-1141514:~/issue# dp --pt test -m model.ckpt-500000.pt -s merged_validation_data [2024-06-05 15:43:53,693] DEEPMD WARNING To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information. [2024-06-05 15:43:54,616] DEEPMD INFO # ---------------output of dp test--------------- [2024-06-05 15:43:54,616] DEEPMD INFO # testing system : merged_validation_data/C2O29H4 [2024-06-05 15:48:18,006] DEEPMD INFO # number of test data : 1404 [2024-06-05 15:48:18,006] DEEPMD INFO Energy MAE : 9.018451e-01 eV [2024-06-05 15:48:18,006] DEEPMD INFO Energy RMSE : 8.481161e+00 eV [2024-06-05 15:48:18,006] DEEPMD INFO Energy MAE/Natoms : 2.576700e-02 eV [2024-06-05 15:48:18,006] DEEPMD INFO Energy RMSE/Natoms : 2.423189e-01 eV [2024-06-05 15:48:18,006] DEEPMD INFO Force MAE : 2.593189e-01 eV/A [2024-06-05 15:48:18,006] DEEPMD INFO Force RMSE : 1.283905e+00 eV/A [2024-06-05 15:48:18,006] DEEPMD INFO Virial MAE : 3.679828e+00 eV [2024-06-05 15:48:18,006] DEEPMD INFO Virial RMSE : 5.281812e+00 eV [2024-06-05 15:48:18,006] DEEPMD INFO Virial MAE/Natoms : 1.051379e-01 eV [2024-06-05 15:48:18,006] DEEPMD INFO Virial RMSE/Natoms : 1.509089e-01 eV [2024-06-05 15:48:18,006] DEEPMD INFO # ----------------------------------------------- [2024-06-05 15:48:18,006] DEEPMD INFO # ---------------output of dp test--------------- [2024-06-05 15:48:18,006] DEEPMD INFO # testing system : merged_validation_data/C2O3H4 [2024-06-05 15:49:38,951] DEEPMD INFO # number of test data : 2208 [2024-06-05 15:49:38,951] DEEPMD INFO Energy MAE : 6.144467e-01 eV [2024-06-05 15:49:38,951] DEEPMD INFO Energy RMSE : 4.617050e+00 eV [2024-06-05 15:49:38,951] DEEPMD INFO Energy MAE/Natoms : 6.827186e-02 eV [2024-06-05 15:49:38,951] DEEPMD INFO Energy RMSE/Natoms : 5.130056e-01 eV [2024-06-05 15:49:38,951] DEEPMD INFO Force MAE : 3.096173e-01 eV/A [2024-06-05 15:49:38,951] DEEPMD INFO Force RMSE : 6.542296e-01 eV/A [2024-06-05 15:49:38,951] DEEPMD INFO Virial MAE : 2.412724e+00 eV [2024-06-05 15:49:38,951] DEEPMD INFO Virial RMSE : 3.296499e+00 eV [2024-06-05 15:49:38,951] DEEPMD INFO Virial MAE/Natoms : 2.680805e-01 eV [2024-06-05 15:49:38,951] DEEPMD INFO Virial RMSE/Natoms : 3.662777e-01 eV [2024-06-05 15:49:38,951] DEEPMD INFO # ----------------------------------------------- [2024-06-05 15:49:38,951] DEEPMD INFO # ----------weighted average of errors----------- [2024-06-05 15:49:38,951] DEEPMD INFO # number of systems : 2 [2024-06-05 15:49:38,951] DEEPMD INFO Energy MAE : 7.261597e-01 eV [2024-06-05 15:49:38,951] DEEPMD INFO Energy RMSE : 6.402392e+00 eV [2024-06-05 15:49:38,951] DEEPMD INFO Energy MAE/Natoms : 5.175004e-02 eV [2024-06-05 15:49:38,951] DEEPMD INFO Energy RMSE/Natoms : 4.286043e-01 eV [2024-06-05 15:49:38,952] DEEPMD INFO Force MAE : 2.738023e-01 eV/A [2024-06-05 15:49:38,952] DEEPMD INFO Force RMSE : 1.138859e+00 eV/A [2024-06-05 15:49:38,952] DEEPMD INFO Virial MAE : 2.905253e+00 eV [2024-06-05 15:49:38,952] DEEPMD INFO Virial RMSE : 4.181720e+00 eV [2024-06-05 15:49:38,952] DEEPMD INFO Virial MAE/Natoms : 2.047440e-01 eV [2024-06-05 15:49:38,952] DEEPMD INFO Virial RMSE/Natoms : 3.014352e-01 eV [2024-06-05 15:49:38,952] DEEPMD INFO # -----------------------------------------------
Does anything get changed after using torch.jit.script
?
It looks like a PyTorch bug. We may first see whether PyTorch 2.3 has fixed it. If not, it will take some time to locate the issue.
Bug summary
Encountered an issue when using the "descriptor": "dpa2" to train a model from scratch for 500k steps and then testing the model on a merged validation dataset. The merged validation dataset contains 7290 frames of data from two sources: C2O29H4_1124 and C2O3H4_6166.
When using dp test, an error occurred: The size of tensor a (17) must match the size of tensor b (25) at non-singleton dimension 1. It appears to be related to the size of the validation dataset, as adding the -n parameter with a smaller value to run successfully.
DeePMD-kit Version
v3.0.0a1.dev81+g23f67a13
Backend and its version
PyTorch 2.0.0, CUDA cu117
How did you download the software?
Others (write below)
Input Files, Running Commands, Error Log, etc.
Command dp --pt test -m model.ckpt.pt -s /share/20240508/merged_validation_data/ Error Log root@bohrium-25571-1132203:/share/20240508# dp --pt test -m model.ckpt.pt -s /share/20240508/C2O29H4/ -n 100 -d results /opt/mamba/lib/python3.10/site-packages/torch/cuda/init.py:107: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.) return torch._C._cuda_getDeviceCount() > 0 [2024-05-11 22:04:24,536] DEEPMD WARNING To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information. [2024-05-11 22:04:26,714] DEEPMD WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes natoms). The default value is 1024. [2024-05-11 22:04:28,443] DEEPMD WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes natoms). The default value is 1024. [2024-05-11 22:04:28,444] DEEPMD INFO # ---------------output of dp test--------------- [2024-05-11 22:04:28,444] DEEPMD INFO # testing system : /share/20240508/C2O29H4 Traceback (most recent call last): File "/opt/mamba/bin/dp", line 8, in
sys.exit(main())
File "/opt/mamba/lib/python3.10/site-packages/deepmd/main.py", line 807, in main
deepmd_main(args)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 64, in main
test(dict_args)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/entrypoints/test.py", line 147, in test
err = test_ener(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/entrypoints/test.py", line 337, in test_ener
ret = dp.eval(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/infer/deep_pot.py", line 158, in eval
results = self.deep_eval.eval(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/infer/deep_eval.py", line 267, in eval
out = self._eval_func(self._eval_model, numb_test, natoms)(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/infer/deep_eval.py", line 339, in eval_func
return self.auto_batch_size.execute_all(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/utils/auto_batch_size.py", line 83, in execute_all
n_batch, result = self.execute(execute_with_batch_size, index, natoms)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/utils/batch_size.py", line 111, in execute
raise e
File "/opt/mamba/lib/python3.10/site-packages/deepmd/utils/batch_size.py", line 108, in execute
n_batch, result = callable(max(batch_nframes, 1), start_index)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/utils/auto_batch_size.py", line 59, in execute_with_batch_size
return (end_index - start_index), callable(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/infer/deep_eval.py", line 409, in _eval_model
batch_output = model(
File "/opt/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(args, kwargs)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/train/wrapper.py", line 173, in forward
model_pred = self.model[task_key](input_dict)
File "/opt/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(args, kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The size of tensor a (17) must match the size of tensor b (25) at non-singleton dimension 1
root@bohrium-25571-1132203:/share/20240508# dp --pt test -m model.ckpt.pt -s /share/20240508/C2O29H4/ -n 10 -d results /opt/mamba/lib/python3.10/site-packages/torch/cuda/init.py:107: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.) return torch._C._cuda_getDeviceCount() > 0 [2024-05-11 22:05:40,841] DEEPMD WARNING To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information. [2024-05-11 22:05:43,028] DEEPMD WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes natoms). The default value is 1024. [2024-05-11 22:05:44,759] DEEPMD WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes natoms). The default value is 1024. [2024-05-11 22:05:44,760] DEEPMD INFO # ---------------output of dp test--------------- [2024-05-11 22:05:44,760] DEEPMD INFO # testing system : /share/20240508/C2O29H4 [2024-05-11 22:05:51,862] DEEPMD INFO # number of test data : 10 [2024-05-11 22:05:51,862] DEEPMD INFO Energy MAE : 1.556307e+00 eV [2024-05-11 22:05:51,862] DEEPMD INFO Energy RMSE : 2.110725e+00 eV [2024-05-11 22:05:51,862] DEEPMD INFO Energy MAE/Natoms : 4.446592e-02 eV [2024-05-11 22:05:51,862] DEEPMD INFO Energy RMSE/Natoms : 6.030642e-02 eV [2024-05-11 22:05:51,862] DEEPMD INFO Force MAE : 5.084302e-01 eV/A [2024-05-11 22:05:51,862] DEEPMD INFO Force RMSE : 7.582858e-01 eV/A [2024-05-11 22:05:51,863] DEEPMD INFO Virial MAE : 6.835131e+00 eV [2024-05-11 22:05:51,863] DEEPMD INFO Virial RMSE : 9.148349e+00 eV [2024-05-11 22:05:51,863] DEEPMD INFO Virial MAE/Natoms : 1.952894e-01 eV [2024-05-11 22:05:51,863] DEEPMD INFO Virial RMSE/Natoms : 2.613814e-01 eV [2024-05-11 22:05:51,911] DEEPMD INFO # -----------------------------------------------
When setting the dataset to a single validation set before merging, I encountered the same error. Modifying the command to:
dp --pt test -m model.ckpt.pt -s /share/20240508/validation_data/ -n 100
worked.
Returning to the merged dataset, command: dp --pt test -m model.ckpt.pt -s /share/20240508/merged_validation_data/ -n 100
and encountered the same error: The size of tensor a (17) must match the size of tensor b (25) at non-singleton dimension 1.
Changing n to 10 worked: dp --pt test -m model.ckpt.pt -s /share/20240508/merged_validation_data/ -n 10
Steps to Reproduce
1.Train the model with "descriptor": "dpa2" from scratch for 500k steps. & cp model.ckpt.pt 2.Merge multiple validation datasets into one dataset named merged_validation_data (7290 frames, C2O29H4_1124, C2O3H4_6166). 3.Run the test command: dp --pt test -m model.ckpt.pt -s /share/20240508/merged_validation_data/
Further Information, Files, and Links
registry.dp.tech/dptech/prod-157/deepmd-kit:202Q1 model_validation_data.zip model: https://drive.google.com/file/d/1lVAJFZBnBr2rb-aevxR_nLdPBp_mZZFB/view?usp=drive_link dpa2_input: input.json
No response