deepmodeling / deepmd-kit

A deep learning package for many-body potential energy representation and molecular dynamics
https://docs.deepmodeling.com/projects/deepmd/
GNU Lesser General Public License v3.0
1.43k stars 494 forks source link

[BUG] Not converge in training on se_atten_v2 descriptor #3103

Open h840473807 opened 7 months ago

h840473807 commented 7 months ago

Bug summary

When I use "se_atten_v2" descriptor to train a model (I have successfully trained it by using se_e2_a), I found that training step will not converge randomly (only in 00.train/002). For the same input file (all parameters are same including seeds), I run the task for 5 times, it converge for only 2 times. The output lcurve.out is shown in the figure. In Figure 1 (not convergent), the value of rmse_f_trn oscillates around 1.0 (too high), while in the Figure 2 (convergent), the value of rmse_f_trn reduces rapidly to below 0.1(same as using se_e2_a). Are there any bugs or problems with my input file? 图片

DeePMD-kit Version

deepmd-kit-2.2.7

TensorFlow Version

/opt/deepmd-kit-2.2.7/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:107

How did you download the software?

Others (write below)

Input Files, Running Commands, Error Log, etc.

It is my input file in 00.train/002 { "model": { "type_map": [ "C", "H", "O", "P", "F", "Na" ], "descriptor": { "type": "se_atten_v2", "sel": "auto", "rcut_smth": 0.5, "rcut": 6.0, "neuron": [ 25, 50, 100 ], "resnet_dt": false, "axis_neuron": 16, "seed": 347445250, "attn_layer": 0, "attn_mask": false, "attn_dotr": true, "type_one_side": false, "precision": "default", "trainable": true, "exclude_types": [], "set_davg_zero": true, "attn": 128, "activation_function": "gelu" }, "type_embedding": { "neuron": [ 8 ], "resnet_dt": false, "seed": 1374335067, "activation_function": "tanh", "precision": "default", "trainable": true }, "fitting_net": { "neuron": [ 240, 240, 240 ], "resnet_dt": true, "seed": 587928124, "activation_function": "tanh" } }, "learning_rate": { "type": "exp", "start_lr": 0.001, "decay_steps": 2000, "decay_rate": 0.95 }, "loss": { "start_pref_e": 0.02, "limit_pref_e": 1, "start_pref_f": 1000, "limit_pref_f": 1, "_start_pref_v": 0, "_limit_pref_v": 1 }, "training": { "set_prefix": "set", "stop_batch": 400000, "batch_size": [ 1, 1, 1, 1 ], "seed": 1867518886, "disp_file": "lcurve.out", "disp_freq": 2000, "save_freq": 2000, "save_ckpt": "model.ckpt", "disp_training": true, "time_training": true, "profiling": false, "profiling_file": "timeline.json", "_comment": "that's all", "systems": [ "../data.init/init/deepmd/0.5M", "../data.init/init/deepmd/1.0M", "../data.init/init/deepmd/1.5M", "../data.init/init/deepmd/2.0M" ] } }

Steps to Reproduce

The file attached is my work file by DP-GEN, Enter the directory 00.train/002 and enter the command "dp train input.json" will run the task. Only 10000-50000 steps could tell whether training can converge. Non-convergence could be obversed after running for several times.

Further Information, Files, and Links

302Na_atten.zip

h840473807 commented 7 months ago

I ran the task in Bohrium. Mirror address is registry.dp.tech/dptech/deepmd-kit:2.2.7-cuda11.6

wanghan-iapcm commented 7 months ago

please try "set_davg_zero": fase

njzjz commented 7 months ago

For the same input file (all parameters are same including seeds)

This is the actual problem, which I can reproduce. I also find an interesting thing. When I run it multiple times, it randomly shows nan in lcurve.out, which needs to be investigated.

First run:

     91      3.25e+01      1.52e-02      1.03e+00    1.0e-03
     92           nan           nan           nan    1.0e-03
     93      3.60e+01      6.42e-02      1.14e+00    1.0e-03

Second run:

     91      3.25e+01      1.52e-02      1.03e+00    1.0e-03
     92      3.66e+01      5.83e-02      1.16e+00    1.0e-03
     93      3.60e+01      6.42e-02      1.14e+00    1.0e-03

Update: Some things I found:

  1. The CPU version does not have such an issue;
  2. The NN parameters may become nan, but cannot be stably reproduced.
  3. Only training energy still has this issue.