deepmodeling / deepmd-kit

A deep learning package for many-body potential energy representation and molecular dynamics
https://docs.deepmodeling.com/projects/deepmd/
GNU Lesser General Public License v3.0
1.45k stars 500 forks source link

[BUG] `RuntimeError` when doing ` dp --pt change-bias` #4036

Closed njzjz closed 1 month ago

njzjz commented 1 month ago

Discussed in https://github.com/deepmodeling/deepmd-kit/discussions/4033

Originally posted by **BianTieyuan** July 29, 2024 Hi developers, Glad to see the release of new version of DPA-2 and deepmd kit. I tried to perform zero-shot testing following the steps in https://www.aissquare.com/models/detail?pageType=models&name=DPA-2.2.0-v3.0.0b3&id=272. But some error occurs. Here are my steps: 1. Install the deepmd-kit v3.0.0b3 via offline package 2. 0 steps training for GST_GAP_22 dataset using `dp --pt change-bias OpenLAM_2.2.0_27heads_beta3.pt -s GST_GAP_22/ --model-branch Domains_SemiCond` 3. Error occurs ```bash (base) [polyucmp@localhost DPA-2-2024Q2]$ dp --pt change-bias OpenLAM_2.2.0_27heads_beta3.pt -s GST_GAP_22 --model-branch Domains_SemiCond To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information. [2024-07-30 11:17:30,839] DEEPMD INFO DeePMD version: 3.0.0b3 [2024-07-30 11:17:32,824] DEEPMD INFO Changing out bias for model Domains_SemiCond. [2024-07-30 11:17:34,832] DEEPMD INFO Packing data for statistics from 89 systems [2024-07-30 11:17:38,850] DEEPMD INFO If you encounter the error 'an illegal memory access was encountered', this may be due to a TensorFlow issue. To avoid this, set the environment variable DP_INFER_BATCH_SIZE to a smaller value than the last adjusted batch size. The environment variable DP_INFER_BATCH_SIZE controls the inference batch size (nframes * natoms). [2024-07-30 11:17:39,318] DEEPMD INFO Adjust batch size from 1024 to 2048 [2024-07-30 11:17:39,960] DEEPMD INFO Adjust batch size from 2048 to 4096 [2024-07-30 11:17:42,621] DEEPMD INFO Adjust batch size from 4096 to 8192 [2024-07-30 11:17:44,142] DEEPMD INFO Adjust batch size from 8192 to 16384 [2024-07-30 11:17:53,832] DEEPMD INFO Adjust batch size from 16384 to 8192 Traceback (most recent call last): File "/run/media/polyucmp/hdd1/BIAN_Tieyuan/software/deepmd-v3.0.0b3/build/bin/dp", line 10, in sys.exit(main()) ^^^^^^ File "/run/media/polyucmp/hdd1/BIAN_Tieyuan/software/deepmd-v3.0.0b3/build/lib/python3.11/site-packages/deepmd/main.py", line 923, in main deepmd_main(args) File "/run/media/polyucmp/hdd1/BIAN_Tieyuan/software/deepmd-v3.0.0b3/build/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/run/media/polyucmp/hdd1/BIAN_Tieyuan/software/deepmd-v3.0.0b3/build/lib/python3.11/site-packages/deepmd/pt/entrypoints/main.py", line 575, in main change_bias(FLAGS) File "/run/media/polyucmp/hdd1/BIAN_Tieyuan/software/deepmd-v3.0.0b3/build/lib/python3.11/site-packages/deepmd/pt/entrypoints/main.py", line 511, in change_bias updated_model = training.model_change_out_bias( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/run/media/polyucmp/hdd1/BIAN_Tieyuan/software/deepmd-v3.0.0b3/build/lib/python3.11/site-packages/deepmd/pt/train/training.py", line 1277, in model_change_out_bias _model.change_out_bias( File "/run/media/polyucmp/hdd1/BIAN_Tieyuan/software/deepmd-v3.0.0b3/build/lib/python3.11/site-packages/deepmd/pt/model/model/make_model.py", line 203, in change_out_bias self.atomic_model.change_out_bias( File "/run/media/polyucmp/hdd1/BIAN_Tieyuan/software/deepmd-v3.0.0b3/build/lib/python3.11/site-packages/deepmd/pt/model/atomic_model/base_atomic_model.py", line 453, in change_out_bias delta_bias, out_std = compute_output_stats( ^^^^^^^^^^^^^^^^^^^^^ File "/run/media/polyucmp/hdd1/BIAN_Tieyuan/software/deepmd-v3.0.0b3/build/lib/python3.11/site-packages/deepmd/pt/utils/stat.py", line 282, in compute_output_stats model_pred = _compute_model_predict(sampled, keys, model_forward) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/run/media/polyucmp/hdd1/BIAN_Tieyuan/software/deepmd-v3.0.0b3/build/lib/python3.11/site-packages/deepmd/pt/utils/stat.py", line 173, in _compute_model_predict sample_predict = model_forward_auto_batch_size( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/run/media/polyucmp/hdd1/BIAN_Tieyuan/software/deepmd-v3.0.0b3/build/lib/python3.11/site-packages/deepmd/pt/utils/stat.py", line 165, in model_forward_auto_batch_size return auto_batch_size.execute_all( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/run/media/polyucmp/hdd1/BIAN_Tieyuan/software/deepmd-v3.0.0b3/build/lib/python3.11/site-packages/deepmd/pt/utils/auto_batch_size.py", line 153, in execute_all r_list = [concate_result(r) for r in zip(*results)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/run/media/polyucmp/hdd1/BIAN_Tieyuan/software/deepmd-v3.0.0b3/build/lib/python3.11/site-packages/deepmd/pt/utils/auto_batch_size.py", line 153, in r_list = [concate_result(r) for r in zip(*results)] ^^^^^^^^^^^^^^^^^ File "/run/media/polyucmp/hdd1/BIAN_Tieyuan/software/deepmd-v3.0.0b3/build/lib/python3.11/site-packages/deepmd/pt/utils/auto_batch_size.py", line 149, in concate_result raise RuntimeError(f"Unexpected result type {type(r[0])}") RuntimeError: Unexpected result type ``` Does this version pf deepmd-kit have bugs?
njzjz commented 1 month ago

Reproduced as below:

cd examples/water/se_atten_compressible
dp --pt train input.json
dp --pt change-bias -s ../data/ model.ckpt.pt

Here returned_dict is set to False when n_batch is 0 and result is None. FYI @wanghan-iapcm

njzjz commented 1 month ago

when n_batch is 0 and result is None

Note: this only happens with an OOM error

njzjz commented 1 month ago

Related PR: #3626