Issue with dp --pt test and validation dataset size

PhelanShao commented 1 month ago

Bug summary

Encountered an issue when using the "descriptor": "dpa2" to train a model from scratch for 500k steps and then testing the model on a merged validation dataset. The merged validation dataset contains 7290 frames of data from two sources: C2O29H4_1124 and C2O3H4_6166.

When using dp test, an error occurred: The size of tensor a (17) must match the size of tensor b (25) at non-singleton dimension 1. It appears to be related to the size of the validation dataset, as adding the -n parameter with a smaller value to run successfully.

DeePMD-kit Version

v3.0.0a1.dev81+g23f67a13

Backend and its version

PyTorch 2.0.0, CUDA cu117

How did you download the software?

Others (write below)

Input Files, Running Commands, Error Log, etc.

Command dp --pt test -m model.ckpt.pt -s /share/20240508/merged_validation_data/ Error Log root@bohrium-25571-1132203:/share/20240508# dp --pt test -m model.ckpt.pt -s /share/20240508/C2O29H4/ -n 100 -d results /opt/mamba/lib/python3.10/site-packages/torch/cuda/init.py:107: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.) return torch._C._cuda_getDeviceCount() > 0 [2024-05-11 22:04:24,536] DEEPMD WARNING To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information. [2024-05-11 22:04:26,714] DEEPMD WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes natoms). The default value is 1024. [2024-05-11 22:04:28,443] DEEPMD WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes natoms). The default value is 1024. [2024-05-11 22:04:28,444] DEEPMD INFO # ---------------output of dp test--------------- [2024-05-11 22:04:28,444] DEEPMD INFO # testing system : /share/20240508/C2O29H4 Traceback (most recent call last): File "/opt/mamba/bin/dp", line 8, in sys.exit(main()) File "/opt/mamba/lib/python3.10/site-packages/deepmd/main.py", line 807, in main deepmd_main(args) File "/opt/mamba/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 64, in main test(dict_args) File "/opt/mamba/lib/python3.10/site-packages/deepmd/entrypoints/test.py", line 147, in test err = test_ener( File "/opt/mamba/lib/python3.10/site-packages/deepmd/entrypoints/test.py", line 337, in test_ener ret = dp.eval( File "/opt/mamba/lib/python3.10/site-packages/deepmd/infer/deep_pot.py", line 158, in eval results = self.deep_eval.eval( File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/infer/deep_eval.py", line 267, in eval out = self._eval_func(self._eval_model, numb_test, natoms)( File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/infer/deep_eval.py", line 339, in eval_func return self.auto_batch_size.execute_all( File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/utils/auto_batch_size.py", line 83, in execute_all n_batch, result = self.execute(execute_with_batch_size, index, natoms) File "/opt/mamba/lib/python3.10/site-packages/deepmd/utils/batch_size.py", line 111, in execute raise e File "/opt/mamba/lib/python3.10/site-packages/deepmd/utils/batch_size.py", line 108, in execute n_batch, result = callable(max(batch_nframes, 1), start_index) File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/utils/auto_batch_size.py", line 59, in execute_with_batch_size return (end_index - start_index), callable( File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/infer/deep_eval.py", line 409, in _eval_model batch_output = model( File "/opt/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/train/wrapper.py", line 173, in forward model_pred = self.model[task_key](input_dict) File "/opt/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): RuntimeError: The size of tensor a (17) must match the size of tensor b (25) at non-singleton dimension 1

root@bohrium-25571-1132203:/share/20240508# dp --pt test -m model.ckpt.pt -s /share/20240508/C2O29H4/ -n 10 -d results /opt/mamba/lib/python3.10/site-packages/torch/cuda/init.py:107: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.) return torch._C._cuda_getDeviceCount() > 0 [2024-05-11 22:05:40,841] DEEPMD WARNING To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information. [2024-05-11 22:05:43,028] DEEPMD WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes natoms). The default value is 1024. [2024-05-11 22:05:44,759] DEEPMD WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes natoms). The default value is 1024. [2024-05-11 22:05:44,760] DEEPMD INFO # ---------------output of dp test--------------- [2024-05-11 22:05:44,760] DEEPMD INFO # testing system : /share/20240508/C2O29H4 [2024-05-11 22:05:51,862] DEEPMD INFO # number of test data : 10 [2024-05-11 22:05:51,862] DEEPMD INFO Energy MAE : 1.556307e+00 eV [2024-05-11 22:05:51,862] DEEPMD INFO Energy RMSE : 2.110725e+00 eV [2024-05-11 22:05:51,862] DEEPMD INFO Energy MAE/Natoms : 4.446592e-02 eV [2024-05-11 22:05:51,862] DEEPMD INFO Energy RMSE/Natoms : 6.030642e-02 eV [2024-05-11 22:05:51,862] DEEPMD INFO Force MAE : 5.084302e-01 eV/A [2024-05-11 22:05:51,862] DEEPMD INFO Force RMSE : 7.582858e-01 eV/A [2024-05-11 22:05:51,863] DEEPMD INFO Virial MAE : 6.835131e+00 eV [2024-05-11 22:05:51,863] DEEPMD INFO Virial RMSE : 9.148349e+00 eV [2024-05-11 22:05:51,863] DEEPMD INFO Virial MAE/Natoms : 1.952894e-01 eV [2024-05-11 22:05:51,863] DEEPMD INFO Virial RMSE/Natoms : 2.613814e-01 eV [2024-05-11 22:05:51,911] DEEPMD INFO # -----------------------------------------------

When setting the dataset to a single validation set before merging, I encountered the same error. Modifying the command to:

dp --pt test -m model.ckpt.pt -s /share/20240508/validation_data/ -n 100

worked.

Returning to the merged dataset, command: dp --pt test -m model.ckpt.pt -s /share/20240508/merged_validation_data/ -n 100

and encountered the same error: The size of tensor a (17) must match the size of tensor b (25) at non-singleton dimension 1.

Changing n to 10 worked: dp --pt test -m model.ckpt.pt -s /share/20240508/merged_validation_data/ -n 10

Steps to Reproduce

1.Train the model with "descriptor": "dpa2" from scratch for 500k steps. & cp model.ckpt.pt 2.Merge multiple validation datasets into one dataset named merged_validation_data (7290 frames, C2O29H4_1124, C2O3H4_6166). 3.Run the test command: dp --pt test -m model.ckpt.pt -s /share/20240508/merged_validation_data/

Further Information, Files, and Links

registry.dp.tech/dptech/prod-157/deepmd-kit:202Q1 model_validation_data.zip model: https://drive.google.com/file/d/1lVAJFZBnBr2rb-aevxR_nLdPBp_mZZFB/view?usp=drive_link dpa2_input: input.json

No response

Chengqian-Zhang commented 1 month ago

I use

v3.0.0a1.dev81+g23f67a13 DeePMD-kit version
2.0.0+cu117 Pytorch version.
the model checkpoint and validation data you provide

I can't reproduce the same error as yours.

My output is: (base) root@bohrium-11461-1141514:~/issue# dp --pt test -m model.ckpt-500000.pt -s merged_validation_data [2024-05-29 14:43:43,099] DEEPMD WARNING To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information. [2024-05-29 14:43:48,504] DEEPMD INFO # ---------------output of dp test--------------- [2024-05-29 14:43:48,504] DEEPMD INFO # testing system : merged_validation_data/C2O29H4 [2024-05-29 14:43:52,597] DEEPMD INFO Adjust batch size from 1024 to 512 /opt/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py:1501: UserWarning: operator() sees varying value in profiling, ignoring and this should be handled by GUARD logic (Triggered internally at ../third_party/nvfuser/csrc/parser.cpp:3777.) return forward_call(*args, **kwargs) [2024-05-29 14:43:54,826] DEEPMD INFO Adjust batch size from 512 to 256 [2024-05-29 14:46:47,689] DEEPMD INFO # number of test data : 1404 [2024-05-29 14:46:47,690] DEEPMD INFO Energy MAE : 9.018451e-01 eV [2024-05-29 14:46:47,690] DEEPMD INFO Energy RMSE : 8.481161e+00 eV [2024-05-29 14:46:47,690] DEEPMD INFO Energy MAE/Natoms : 2.576700e-02 eV [2024-05-29 14:46:47,690] DEEPMD INFO Energy RMSE/Natoms : 2.423189e-01 eV [2024-05-29 14:46:47,690] DEEPMD INFO Force MAE : 2.593189e-01 eV/A [2024-05-29 14:46:47,690] DEEPMD INFO Force RMSE : 1.283905e+00 eV/A [2024-05-29 14:46:47,690] DEEPMD INFO Virial MAE : 3.679828e+00 eV [2024-05-29 14:46:47,690] DEEPMD INFO Virial RMSE : 5.281812e+00 eV [2024-05-29 14:46:47,690] DEEPMD INFO Virial MAE/Natoms : 1.051379e-01 eV [2024-05-29 14:46:47,690] DEEPMD INFO Virial RMSE/Natoms : 1.509089e-01 eV [2024-05-29 14:46:47,690] DEEPMD INFO # ----------------------------------------------- [2024-05-29 14:46:47,690] DEEPMD INFO # ---------------output of dp test--------------- [2024-05-29 14:46:47,690] DEEPMD INFO # testing system : merged_validation_data/C2O3H4 [2024-05-29 14:47:44,314] DEEPMD INFO # number of test data : 2208 [2024-05-29 14:47:44,314] DEEPMD INFO Energy MAE : 6.144467e-01 eV [2024-05-29 14:47:44,314] DEEPMD INFO Energy RMSE : 4.617050e+00 eV [2024-05-29 14:47:44,314] DEEPMD INFO Energy MAE/Natoms : 6.827186e-02 eV [2024-05-29 14:47:44,314] DEEPMD INFO Energy RMSE/Natoms : 5.130056e-01 eV [2024-05-29 14:47:44,314] DEEPMD INFO Force MAE : 3.096173e-01 eV/A [2024-05-29 14:47:44,314] DEEPMD INFO Force RMSE : 6.542296e-01 eV/A [2024-05-29 14:47:44,314] DEEPMD INFO Virial MAE : 2.412724e+00 eV [2024-05-29 14:47:44,314] DEEPMD INFO Virial RMSE : 3.296499e+00 eV [2024-05-29 14:47:44,314] DEEPMD INFO Virial MAE/Natoms : 2.680805e-01 eV [2024-05-29 14:47:44,314] DEEPMD INFO Virial RMSE/Natoms : 3.662777e-01 eV [2024-05-29 14:47:44,314] DEEPMD INFO # ----------------------------------------------- [2024-05-29 14:47:44,314] DEEPMD INFO # ----------weighted average of errors----------- [2024-05-29 14:47:44,314] DEEPMD INFO # number of systems : 2 [2024-05-29 14:47:44,315] DEEPMD INFO Energy MAE : 7.261597e-01 eV [2024-05-29 14:47:44,315] DEEPMD INFO Energy RMSE : 6.402392e+00 eV [2024-05-29 14:47:44,315] DEEPMD INFO Energy MAE/Natoms : 5.175004e-02 eV [2024-05-29 14:47:44,315] DEEPMD INFO Energy RMSE/Natoms : 4.286043e-01 eV [2024-05-29 14:47:44,315] DEEPMD INFO Force MAE : 2.738023e-01 eV/A [2024-05-29 14:47:44,315] DEEPMD INFO Force RMSE : 1.138859e+00 eV/A [2024-05-29 14:47:44,315] DEEPMD INFO Virial MAE : 2.905253e+00 eV [2024-05-29 14:47:44,315] DEEPMD INFO Virial RMSE : 4.181720e+00 eV [2024-05-29 14:47:44,315] DEEPMD INFO Virial MAE/Natoms : 2.047440e-01 eV [2024-05-29 14:47:44,315] DEEPMD INFO Virial RMSE/Natoms : 3.014352e-01 eV [2024-05-29 14:47:44,315] DEEPMD INFO # -----------------------------------------------

njzjz commented 1 month ago

@Chengqian-Zhang could you find a machine with large memory and set DP_INFER_BATCH_SIZE to 1024? I am wondering DP_INFER_BATCH_SIZE=256 may be not able to reproduce the issue,

Chengqian-Zhang commented 1 month ago

@Chengqian-Zhang could you find a machine with large memory and set DP_INFER_BATCH_SIZE to 1024? I am wondering DP_INFER_BATCH_SIZE=256 may be not able to reproduce the issue,

@njzjz I use a machine with large memory(GPU Memory = 4 * 32 GB , Memory = 368GB) and set DP_INFER_BATCH_SIZE to 1024. But it still Adjust batch size from 1024 to 512. How can I solve it?

Chengqian-Zhang commented 1 month ago

I removed the part in the code that automatically adjust batch size. And I find that when DP_INFER_BATCH_SIZE = 512, no error occurs. When DP_INFER_BATCH_SIZE = 1024, CUDA OOM happens, only single GPU(32GB) is used. How can I use four GPUs to do dp test?

njzjz commented 1 month ago

GPU Memory = 4 * 32 GB , Memory = 368GB

Using the CPU.

Chengqian-Zhang commented 1 month ago

GPU Memory = 4 * 32 GB , Memory = 368GB

Using the CPU.

I use a c52_m384_cpu machine with 384GB memory. I successfully reproduce this error. No matter I set DP_INFER_BATCH_SIZE to 1-->1024, the error always occurs.

(base) root@bohrium-11461-1141514:~/issue# dp --pt test -m model.ckpt-500000.pt -s merged_validation_data
[2024-06-05 14:30:29,021] DEEPMD WARNING To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information. [2024-06-05 14:30:32,283] DEEPMD INFO # ---------------output of dp test--------------- [2024-06-05 14:30:32,283] DEEPMD INFO # testing system : merged_validation_data/C2O29H4 Traceback (most recent call last): File "/opt/mamba/bin/dp", line 8, in sys.exit(main()) File "/opt/mamba/lib/python3.10/site-packages/deepmd/main.py", line 807, in main deepmd_main(args) File "/opt/mamba/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 64, in main test(dict_args) File "/opt/mamba/lib/python3.10/site-packages/deepmd/entrypoints/test.py", line 147, in test err = test_ener( File "/opt/mamba/lib/python3.10/site-packages/deepmd/entrypoints/test.py", line 337, in test_ener ret = dp.eval( File "/opt/mamba/lib/python3.10/site-packages/deepmd/infer/deep_pot.py", line 158, in eval results = self.deep_eval.eval( File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/infer/deep_eval.py", line 267, in eval out = self._eval_func(self._eval_model, numb_test, natoms)( File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/infer/deep_eval.py", line 339, in eval_func return self.auto_batch_size.execute_all( File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/utils/auto_batch_size.py", line 83, in execute_all n_batch, result = self.execute(execute_with_batch_size, index, natoms) File "/opt/mamba/lib/python3.10/site-packages/deepmd/utils/batch_size.py", line 107, in execute n_batch, result = callable(max(batch_nframes, 1), start_index) File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/utils/auto_batch_size.py", line 59, in execute_with_batch_size return (end_index - start_index), callable( File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/infer/deep_eval.py", line 409, in _eval_model batch_output = model( File "/opt/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/train/wrapper.py", line 173, in forward model_pred = self.model[task_key](input_dict) File "/opt/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): RuntimeError: The size of tensor a (17) must match the size of tensor b (25) at non-singleton dimension 1

njzjz commented 1 month ago

@Chengqian-Zhang could you comment the torch jit, which will give the better traceback.

https://github.com/deepmodeling/deepmd-kit/blob/eb474d485f786c95c805cd5dabbb213f1d872725/deepmd/pt/infer/deep_eval.py#L123

Chengqian-Zhang commented 1 month ago

@Chengqian-Zhang could you comment the torch jit, which will give the better traceback.

deepmd-kit/deepmd/pt/infer/deep_eval.py

Line 123 in eb474d4

model = torch.jit.script(model)

@njzjz After I comment the torch jit, the error disappers and everything works well.....

(base) root@bohrium-11461-1141514:~/issue# dp --pt test -m model.ckpt-500000.pt -s merged_validation_data [2024-06-05 15:43:53,693] DEEPMD WARNING To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information. [2024-06-05 15:43:54,616] DEEPMD INFO # ---------------output of dp test--------------- [2024-06-05 15:43:54,616] DEEPMD INFO # testing system : merged_validation_data/C2O29H4 [2024-06-05 15:48:18,006] DEEPMD INFO # number of test data : 1404 [2024-06-05 15:48:18,006] DEEPMD INFO Energy MAE : 9.018451e-01 eV [2024-06-05 15:48:18,006] DEEPMD INFO Energy RMSE : 8.481161e+00 eV [2024-06-05 15:48:18,006] DEEPMD INFO Energy MAE/Natoms : 2.576700e-02 eV [2024-06-05 15:48:18,006] DEEPMD INFO Energy RMSE/Natoms : 2.423189e-01 eV [2024-06-05 15:48:18,006] DEEPMD INFO Force MAE : 2.593189e-01 eV/A [2024-06-05 15:48:18,006] DEEPMD INFO Force RMSE : 1.283905e+00 eV/A [2024-06-05 15:48:18,006] DEEPMD INFO Virial MAE : 3.679828e+00 eV [2024-06-05 15:48:18,006] DEEPMD INFO Virial RMSE : 5.281812e+00 eV [2024-06-05 15:48:18,006] DEEPMD INFO Virial MAE/Natoms : 1.051379e-01 eV [2024-06-05 15:48:18,006] DEEPMD INFO Virial RMSE/Natoms : 1.509089e-01 eV [2024-06-05 15:48:18,006] DEEPMD INFO # ----------------------------------------------- [2024-06-05 15:48:18,006] DEEPMD INFO # ---------------output of dp test--------------- [2024-06-05 15:48:18,006] DEEPMD INFO # testing system : merged_validation_data/C2O3H4 [2024-06-05 15:49:38,951] DEEPMD INFO # number of test data : 2208 [2024-06-05 15:49:38,951] DEEPMD INFO Energy MAE : 6.144467e-01 eV [2024-06-05 15:49:38,951] DEEPMD INFO Energy RMSE : 4.617050e+00 eV [2024-06-05 15:49:38,951] DEEPMD INFO Energy MAE/Natoms : 6.827186e-02 eV [2024-06-05 15:49:38,951] DEEPMD INFO Energy RMSE/Natoms : 5.130056e-01 eV [2024-06-05 15:49:38,951] DEEPMD INFO Force MAE : 3.096173e-01 eV/A [2024-06-05 15:49:38,951] DEEPMD INFO Force RMSE : 6.542296e-01 eV/A [2024-06-05 15:49:38,951] DEEPMD INFO Virial MAE : 2.412724e+00 eV [2024-06-05 15:49:38,951] DEEPMD INFO Virial RMSE : 3.296499e+00 eV [2024-06-05 15:49:38,951] DEEPMD INFO Virial MAE/Natoms : 2.680805e-01 eV [2024-06-05 15:49:38,951] DEEPMD INFO Virial RMSE/Natoms : 3.662777e-01 eV [2024-06-05 15:49:38,951] DEEPMD INFO # ----------------------------------------------- [2024-06-05 15:49:38,951] DEEPMD INFO # ----------weighted average of errors----------- [2024-06-05 15:49:38,951] DEEPMD INFO # number of systems : 2 [2024-06-05 15:49:38,951] DEEPMD INFO Energy MAE : 7.261597e-01 eV [2024-06-05 15:49:38,951] DEEPMD INFO Energy RMSE : 6.402392e+00 eV [2024-06-05 15:49:38,951] DEEPMD INFO Energy MAE/Natoms : 5.175004e-02 eV [2024-06-05 15:49:38,951] DEEPMD INFO Energy RMSE/Natoms : 4.286043e-01 eV [2024-06-05 15:49:38,952] DEEPMD INFO Force MAE : 2.738023e-01 eV/A [2024-06-05 15:49:38,952] DEEPMD INFO Force RMSE : 1.138859e+00 eV/A [2024-06-05 15:49:38,952] DEEPMD INFO Virial MAE : 2.905253e+00 eV [2024-06-05 15:49:38,952] DEEPMD INFO Virial RMSE : 4.181720e+00 eV [2024-06-05 15:49:38,952] DEEPMD INFO Virial MAE/Natoms : 2.047440e-01 eV [2024-06-05 15:49:38,952] DEEPMD INFO Virial RMSE/Natoms : 3.014352e-01 eV [2024-06-05 15:49:38,952] DEEPMD INFO # -----------------------------------------------

Chengqian-Zhang commented 1 month ago

Does anything get changed after using torch.jit.script?

njzjz commented 1 month ago

It looks like a PyTorch bug. We may first see whether PyTorch 2.3 has fixed it. If not, it will take some time to locate the issue.

deepmodeling / deepmd-kit