deepmodeling / dpgen

The deep potential generator to generate a deep-learning based model of interatomic potential energy and force field
https://docs.deepmodeling.com/projects/dpgen/
GNU Lesser General Public License v3.0
291 stars 173 forks source link

[BUG] Interstitial Autotest fails with DeePMD-kit 2.1.5 and LAMMPS 2022 #1051

Open AnguseZhang opened 1 year ago

AnguseZhang commented 1 year ago

Summary When executing dpgen autotest post property.json to calculate interstitial properties, one may encounter such problem.

Traceback (most recent call last):
  File "/opt/anaconda3/bin/dpgen", line 8, in <module>
    sys.exit(main())
  File "/opt/anaconda3/lib/python3.7/site-packages/dpgen/main.py", line 185, in main
    args.func(args)
  File "/opt/anaconda3/lib/python3.7/site-packages/dpgen/auto_test/run.py", line 57, in gen_test
    run_task(args.TASK, args.PARAM, args.MACHINE)
  File "/opt/anaconda3/lib/python3.7/site-packages/dpgen/auto_test/run.py", line 48, in run_task
    post_property(confs, property_list)
  File "/opt/anaconda3/lib/python3.7/site-packages/dpgen/auto_test/common_prop.py", line 249, in post_property
    path_to_work)
  File "/opt/anaconda3/lib/python3.7/site-packages/dpgen/auto_test/Property.py", line 101, in compute
    res = task.compute(ii)
  File "/opt/anaconda3/lib/python3.7/site-packages/dpgen/auto_test/Lammps.py", line 369, in compute
    d_dump = loadfn(contcar)
  File "/opt/anaconda3/lib/python3.7/site-packages/monty/serialization.py", line 88, in loadfn
    return json.load(fp, *args, **kwargs)
  File "/opt/anaconda3/lib/python3.7/json/__init__.py", line 296, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/opt/anaconda3/lib/python3.7/json/__init__.py", line 361, in loads
    return cls(**kw).decode(s)
  File "/opt/anaconda3/lib/python3.7/site-packages/monty/json.py", line 368, in decode
    return self.process_decoded(d)
  File "/opt/anaconda3/lib/python3.7/site-packages/monty/json.py", line 340, in process_decoded
    return cls_.from_dict(data)
  File "/opt/anaconda3/lib/python3.7/site-packages/monty/json.py", line 175, in from_dict
    return cls(**decoded)
  File "/opt/anaconda3/lib/python3.7/site-packages/dpdata-0.2.9.dev0+g0db9246.d20220828-py3.7.egg/dpdata/system.py", line 223, in __init__
    self.check_data()
  File "/opt/anaconda3/lib/python3.7/site-packages/dpdata-0.2.9.dev0+g0db9246.d20220828-py3.7.egg/dpdata/system.py", line 243, in check_data
    dd.check(self)
  File "/opt/anaconda3/lib/python3.7/site-packages/dpdata-0.2.9.dev0+g0db9246.d20220828-py3.7.egg/dpdata/system.py", line 122, in check
    data.shape, shape))
dpdata.system.DataError: Shape of energies is (14,), but expected (30,)

After some efforts, I located the problem. In dp_test_Al_autotest/confs/mp-134/interstitial_00/task.000001, DP-GEN launch a LAMMPS task, whose main process is three successive minimizations for relaxed interstitial structure with DeePMD. Relavant settings for LAMMPS is

pair_style deepmd frozen_model.pb
pair_coeff * *
compute         mype all pe
thermo          100
thermo_style    custom step pe pxx pyy pzz pxy pxz pyz lx ly lz vol c_mype
dump            1 all custom 100 dump.relax id type xs ys zs fx fy fz
min_style       cg
fix             1 all box/relax iso 0.0
minimize        1.000000e-12 1.000000e-06 5000 500000
fix             1 all box/relax aniso 0.0
minimize        1.000000e-12 1.000000e-06 5000 500000
fix             1 all box/relax tri 0.0
minimize        1.000000e-12 1.000000e-06 5000 500000

Notice that we dump structures each 100 steps and print potential energy each 100 steps. The problem is that in LAMMPS 23 Jun 2022 - Update 1 the recorded information for dump and thermo is not synchronous if you set up multiple successive simulations.

Logfile for the third minimizaton is

Per MPI rank memory allocation (min/avg/max) = 5.52 | 5.52 | 5.52 Mbytes
   Step         PotEng          Pxx            Pyy            Pzz            Pxy            Pxz            Pyz             Lx             Ly             Lz           Volume         c_mype
      1017  -101.90358     -772.61469      771.65716     -0.36175532    -1337.3948      1891.3693      1091.8029      8.771989       7.5967654      7.162299       477.2866      -101.90358
      1100  -101.90762     -34.807043      20.106812     -20.3897        1.3568185      41.463714     -7.8191385      8.7603976      7.6077917      7.1629873      477.39362     -101.90762
      1200  -101.90818     -39.620708     -21.928825     -18.883387      27.079249     -10.078848      6.308225       8.7602294      7.5940946      7.1764093      477.41788     -101.90818
      1300  -101.91165     -103.61287     -117.65829     -83.663539      28.226585      27.846368      12.713908      8.7602825      7.5706573      7.199864       477.50288     -101.91165
      1400  -101.91818     -99.215486     -123.04487     -67.82036       39.668083      42.864046      18.156382      8.7602265      7.5478631      7.2229665      477.5897      -101.91818
      1500  -101.92792     -96.661198     -128.72016     -54.5928        49.094454      57.910921      24.387902      8.7601492      7.5253981      7.2465908      477.72143     -101.92792
      1600  -101.94216     -91.922692     -130.77557     -39.930865      56.278056      73.16554       31.433556      8.7600214      7.5015185      7.2727058      477.91469     -101.94216
      1700  -101.96203     -87.492487     -130.75189     -27.319727      60.323659      87.493159      39.08075       8.7598244      7.4762116      7.3016725      478.18872     -101.96203
      1800  -101.9878      -85.647748     -129.82396     -20.663677      60.188314      98.408569      46.345628      8.7595231      7.4504513      7.3327726      478.55434     -101.9878
      1900  -102.01816     -87.973408     -129.23356     -22.803268      55.734739      103.19599      51.741111      8.7590678      7.4258866      7.3643159      479.00341     -102.01816
      2000  -102.04999     -95.021719     -130.46387     -34.283235      48.257491      100.36043      53.885498      8.7584069      7.4042476      7.3941194      479.50431     -102.04999
      2100  -102.07949     -106.25431     -134.83933     -53.073211      40.008772      90.769909      52.357739      8.7575157      7.386654       7.4203245      480.01143     -102.07949
      2200  -102.10399     -122.00801     -144.46013     -77.429046      33.078148      76.944583      47.882339      8.7564011      7.3732235      7.4421447      480.48646     -102.10399
      2300  -102.12276     -140.47393     -158.42258     -104.09328      28.371295      61.385216      41.492416      8.7550646      7.3633464      7.4598188      480.90896     -102.12276
      2400  -102.13657     -158.44886     -173.58626     -129.42062      25.735063      45.46496       33.841815      8.7534414      7.3560835      7.4743342      481.28019     -102.13657
      2500  -102.14665     -159.48911     -172.30748     -138.39812      22.88323       28.300543      23.829957      8.751197       7.3504268      7.4871837      481.61334     -102.14665
      2600  -102.15202     -13.41267       109.82945      19.704164      21.189441      27.882556     -19.488702      8.7482641      7.3450543      7.4977028      481.77595     -102.15202
      2700  -102.15255     -0.26741686     1.7729746      0.20781206     0.39057458     0.22548034    -0.37762152     8.7464012      7.3445786      7.5012581      481.87055     -102.15255
      2705  -102.15255     -0.31011589    -0.22210737    -0.038315831    0.32848199    -0.29074246     0.04699747     8.7463928      7.3445916      7.501259       481.871       -102.15255
Loop time of 4.03272 on 1 procs for 1688 steps with 28 atoms

110.0% CPU use with 1 MPI tasks x 1 OpenMP threads

As we can see, since thermo frequency is 100, energy information will be printed at 1100, 1200, 1300... etc. steps.

However, when we look at dump.relax which restores structure information by grep -A 1 "TIMESTEP" dump.relax, we can see information will be printed at 1017, 1117, 1217 steps. This will cause error for dpgen autotest.

ITEM: TIMESTEP
1017
--
ITEM: TIMESTEP
1117
--
ITEM: TIMESTEP
1217
--
ITEM: TIMESTEP
1317
--
ITEM: TIMESTEP
1417
--
ITEM: TIMESTEP
1517
--
ITEM: TIMESTEP
1617
--
ITEM: TIMESTEP
1717
--
ITEM: TIMESTEP
1817
--
ITEM: TIMESTEP
1917
--
ITEM: TIMESTEP
2017
--
ITEM: TIMESTEP
2117
--
ITEM: TIMESTEP
2217
--
ITEM: TIMESTEP
2317
--
ITEM: TIMESTEP
2417
--
ITEM: TIMESTEP
2517
--
ITEM: TIMESTEP
2617
--
ITEM: TIMESTEP
2705

This behavior is relevant to LAMMPS version. I then tried DeePMD-kit 2.1.1 with LAMMPS (29 Sep 2021 - Update 3), and run the same minimization. Dumped file is like

ITEM: TIMESTEP
1111
--
ITEM: TIMESTEP
1200
--
ITEM: TIMESTEP
1300
--
ITEM: TIMESTEP
1400
--
ITEM: TIMESTEP
1500
--
ITEM: TIMESTEP
1600
--
ITEM: TIMESTEP
1700
--
ITEM: TIMESTEP
1800
--
ITEM: TIMESTEP
1900
--
ITEM: TIMESTEP
2000
--
ITEM: TIMESTEP
2100
--
ITEM: TIMESTEP
2200

DPGEN Version and Platform DP-GEN 0.10.6, DeePMD-kit 2.1.5-cuda11, LAMMPS 23 Jun 2022 - Update 1.

Job submission and computing cluster configuration

DP-GEN runs on MacOS and Linux.

Expected Behavior

This has been described.

Actual Behavior

This has been described.

Steps to Reproduce

About DP-GEN autotest:

unzip dp_test_Al_autotest.zip; cd dp_test_Al_autotest; dpgen autotest post property.json;

About LAMMPS:

unzip dp_test_Al_autotest.zip; dp_test_Al_autotest/confs/mp-134/interstitial_00/task.000001; lmp -i in.lammps;

**Further Information, Files, and links dp_test_Al_autotest.zip

njzjz commented 1 year ago

reset_timestep may help.

ZLI-afk commented 1 year ago

This doesn't seem to be the autotest's problem.

ZLI-afk commented 1 year ago

I didn't encounter this bug with cpu-version of DeepMD-Kit=2.1.5 and Lammps=20220623. Maybe something goes wrong with gpu-version of the new DeepMD-Kit?

DM0815 commented 1 year ago

Hi ,yuzhi, I met the same question. Do you solve the problem? Can you give me some hints

njzjz commented 1 year ago

@AnguseZhang I think it will be helpful to add reset_timestep 0 after each minimize command. Please help check it.