When run DPA-2 (dp_tf) train calculation with Deepkit-V3.0.0a0, we always got following WARNINGS:
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: trn: rmse = 9.27e+00, rmse_e = 5.18e-02, rmse_f = 3.20e-01, lr = 1.68e-04
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: val: rmse = 2.10e+01, rmse_e = 2.09e-01, rmse_f = 7.22e-01
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: total wall time = 867.78 s
[2024-07-05 11:04:32,527] DEEPMD WARNING Data loading buffer is empty or nearly empty. This may indicate a data loading bottleneck, and increasing the number of workers (--num-workers) may help.
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: trn: rmse = 8.23e+00, rmse_e = 1.09e-01, rmse_f = 2.84e-01, lr = 1.68e-04
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: val: rmse = 8.26e+00, rmse_e = 2.24e-01, rmse_f = 2.84e-01
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: total wall time = 948.59 s
This means that loading the data takes up a lot of time during training, which makes my training very slow.
I would like to know whether we have solutions to fix this problem.
By the way: keywords "stat_file": "./dpa2 was used.
DeePMD-kit Version
V3.0.0a0
Backend and its version
deepmd-kit.3.0_cuda123/lib/python3.11
How did you download the software?
Offline packages
Input Files, Running Commands, Error Log, etc.
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: trn: rmse = 9.27e+00, rmse_e = 5.18e-02, rmse_f = 3.20e-01, lr = 1.68e-04
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: val: rmse = 2.10e+01, rmse_e = 2.09e-01, rmse_f = 7.22e-01
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: total wall time = 867.78 s
[2024-07-05 11:04:32,527] DEEPMD WARNING Data loading buffer is empty or nearly empty. This may indicate a data loading bottleneck, and increasing the number of workers (--num-workers) may help.
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: trn: rmse = 8.23e+00, rmse_e = 1.09e-01, rmse_f = 2.84e-01, lr = 1.68e-04
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: val: rmse = 8.26e+00, rmse_e = 2.24e-01, rmse_f = 2.84e-01
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: total wall time = 948.59 s
[2024-07-05 11:19:32,806] DEEPMD WARNING Data loading buffer is empty or nearly empty. This may indicate a data loading bottleneck, and increasing the number of workers (--num-workers) may help.
[2024-07-05 11:30:11,391] DEEPMD INFO batch 24000: trn: rmse = 1.21e+01, rmse_e = 1.30e-01, rmse_f = 4.21e-01, lr = 1.65e-04
[2024-07-05 11:30:11,392] DEEPMD INFO batch 24000: val: rmse = 9.12e+00, rmse_e = 1.42e-01, rmse_f = 3.18e-01
[2024-07-05 11:30:11,392] DEEPMD INFO batch 24000: total wall time = 863.67 s
{ "_comment": "that's all", "model": { "type_map": [ "H", "C", "N", "O" ], "descriptor": { "type": "dpa2", "tebd_dim": 8, "repinit_rcut": 7.0, "repinit_rcut_smth": 6.0, "repinit_nsel": 100, "repformer_rcut": 4.0, "repformer_rcut_smth": 3.5, "repformer_nsel": 40, "repinit_neuron": [ 25, 50, 100 ], "repinit_axis_neuron": 12, "repinit_activation": "tanh", "repformer_nlayers": 12, "repformer_g1_dim": 128, "repformer_g2_dim": 32, "repformer_attn2_hidden": 32, "repformer_attn2_nhead": 4, "repformer_attn1_hidden": 128, "repformer_attn1_nhead": 4, "repformer_axis_dim": 4, "repformer_update_h2": false, "repformer_update_g1_has_conv": true, "repformer_update_g1_has_grrg": true, "repformer_update_g1_has_drrd": true, "repformer_update_g1_has_attn": true, "repformer_update_g2_has_g1g1": true, "repformer_update_g2_has_attn": true, "repformer_attn2_has_gate": true, "repformer_add_type_ebd_to_seq": false },
Bug summary
When run DPA-2 (dp_tf) train calculation with Deepkit-V3.0.0a0, we always got following WARNINGS:
This means that loading the data takes up a lot of time during training, which makes my training very slow. I would like to know whether we have solutions to fix this problem.
By the way: keywords
"stat_file": "./dpa2
was used.DeePMD-kit Version
V3.0.0a0
Backend and its version
deepmd-kit.3.0_cuda123/lib/python3.11
How did you download the software?
Offline packages
Input Files, Running Commands, Error Log, etc.
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: trn: rmse = 9.27e+00, rmse_e = 5.18e-02, rmse_f = 3.20e-01, lr = 1.68e-04 [2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: val: rmse = 2.10e+01, rmse_e = 2.09e-01, rmse_f = 7.22e-01 [2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: total wall time = 867.78 s [2024-07-05 11:04:32,527] DEEPMD WARNING Data loading buffer is empty or nearly empty. This may indicate a data loading bottleneck, and increasing the number of workers (--num-workers) may help. [2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: trn: rmse = 8.23e+00, rmse_e = 1.09e-01, rmse_f = 2.84e-01, lr = 1.68e-04 [2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: val: rmse = 8.26e+00, rmse_e = 2.24e-01, rmse_f = 2.84e-01 [2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: total wall time = 948.59 s [2024-07-05 11:19:32,806] DEEPMD WARNING Data loading buffer is empty or nearly empty. This may indicate a data loading bottleneck, and increasing the number of workers (--num-workers) may help. [2024-07-05 11:30:11,391] DEEPMD INFO batch 24000: trn: rmse = 1.21e+01, rmse_e = 1.30e-01, rmse_f = 4.21e-01, lr = 1.65e-04 [2024-07-05 11:30:11,392] DEEPMD INFO batch 24000: val: rmse = 9.12e+00, rmse_e = 1.42e-01, rmse_f = 3.18e-01 [2024-07-05 11:30:11,392] DEEPMD INFO batch 24000: total wall time = 863.67 s
{ "_comment": "that's all", "model": { "type_map": [ "H", "C", "N", "O" ], "descriptor": { "type": "dpa2", "tebd_dim": 8, "repinit_rcut": 7.0, "repinit_rcut_smth": 6.0, "repinit_nsel": 100, "repformer_rcut": 4.0, "repformer_rcut_smth": 3.5, "repformer_nsel": 40, "repinit_neuron": [ 25, 50, 100 ], "repinit_axis_neuron": 12, "repinit_activation": "tanh", "repformer_nlayers": 12, "repformer_g1_dim": 128, "repformer_g2_dim": 32, "repformer_attn2_hidden": 32, "repformer_attn2_nhead": 4, "repformer_attn1_hidden": 128, "repformer_attn1_nhead": 4, "repformer_axis_dim": 4, "repformer_update_h2": false, "repformer_update_g1_has_conv": true, "repformer_update_g1_has_grrg": true, "repformer_update_g1_has_drrd": true, "repformer_update_g1_has_attn": true, "repformer_update_g2_has_g1g1": true, "repformer_update_g2_has_attn": true, "repformer_attn2_has_gate": true, "repformer_add_type_ebd_to_seq": false },
Steps to Reproduce
mpirun -np 4 dp --pt train --skip-neighbor-stat --mpi-log=master input.json
Further Information, Files, and Links
No response