[BUG] In the release 2.0.0 version, the lammps result is abnormal

gzq942560379 commented 3 years ago

I use the water data in the examples folder to train the model. When using this model for lammps MD inference, the results of lammps were abnormal. This problem did not occur in the previous bate version. I used the same model to get different results in the release version and the previous bate version. Below are my training input files, dp test output, and lammps output.

training input.json

{
    "_comment": " model parameters",
    "model": {
    "type_map": ["O", "H"],
    "descriptor" :{
        "type":     "se_e2_a",
        "sel":      [46, 92],
        "rcut_smth":    0.50,
        "rcut":     6.00,
        "neuron":       [32, 64, 128],
        "resnet_dt":    false,
        "axis_neuron":  16,
        "seed":     1,
        "_comment":     " that's all"
    },
    "fitting_net" : {
        "neuron":       [240, 240, 240],
        "resnet_dt":    true,
        "seed":     1,
        "_comment":     " that's all"
    },
    "_comment": " that's all"
    },

    "learning_rate" :{
    "type":     "exp",
    "decay_steps":  5000,
    "start_lr": 0.001,  
    "stop_lr":  3.51e-8,
    "_comment": "that's all"
    },

    "loss" :{
    "type":     "ener",
    "start_pref_e": 0.02,
    "limit_pref_e": 1,
    "start_pref_f": 1000,
    "limit_pref_f": 1,
    "start_pref_v": 0,
    "limit_pref_v": 0,
    "_comment": " that's all"
    },

    "training" : {
    "training_data": {
        "systems":      ["../data/data_0/", "../data/data_1/", "../data/data_2/"],
        "batch_size":   "auto",
        "_comment":     "that's all"
    },
    "validation_data":{
        "systems":      ["../data/data_3"],
        "batch_size":   1,
        "numb_btch":    3,
        "_comment":     "that's all"
    },
    "numb_steps":   100000,
    "seed":     10,
    "disp_file":    "lcurve.out",
    "disp_freq":    100,
    "save_freq":    1000,
    "_comment": "that's all"
    },    

    "_comment":     "that's all"
}

test result

DEEPMD INFO    # number of test data : 1 
DEEPMD INFO    Energy RMSE        : 7.472860e-02 eV
DEEPMD INFO    Energy RMSE/Natoms : 3.892115e-04 eV
DEEPMD INFO    Force  RMSE        : 5.845049e-02 eV/A
DEEPMD INFO    Virial RMSE        : 6.561497e+00 eV
DEEPMD INFO    Virial RMSE/Natoms : 3.417446e-02 eV

lammps result

LAMMPS (30 Jul 2021)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task
# bulk water

units           metal
boundary        p p p
atom_style      atomic

neighbor        2.0 bin
neigh_modify    every 50 delay 0 check no

read_data           ../lmp/water.lmp
Reading data file ...
  triclinic box = (0.0000000 0.0000000 0.0000000) to (12.444700 12.444700 12.444700) with tilt (0.0000000 0.0000000 0.0000000)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  192 atoms
  read_data CPU = 0.001 seconds
mass                1 16
mass                2 2

replicate       1 1 1
Replicating atoms ...
  triclinic box = (0.0000000 0.0000000 0.0000000) to (12.444700 12.444700 12.444700) with tilt (0.0000000 0.0000000 0.0000000)
  1 by 1 by 1 MPI processor grid
  192 atoms
  replicate CPU = 0.001 seconds

# load the plugin at <install_prefix>/lib/libdeepmd_lmp.so
plugin load     ../../../dp/lib/libdeepmd_lmp.so
2021-09-06 05:07:15.880999: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Loading plugin: deepmd pair style v2.0 by Han Wang
Loading plugin: compute deeptensor/atom v2.0 by Han Wang
Loading plugin: fix dplr v2.0 by Han Wang

pair_style          deepmd ../../../model/water/graph-original.pb
Summary of lammps deepmd module ...
  >>> Info of deepmd-kit:
  installed to:       /home/user/deepmd-kit/dp
  source:             v2.0.0
  source branch:       HEAD
  source commit:      1a25414
  source commit at:   2021-08-28 08:15:38 +0800
  surpport model ver.:1.0 
  build float prec:   double
  build with tf inc:  /home/user/software/tensorflow-gpu-2.4/include;/home/user/software/tensorflow-gpu-2.4/include
  build with tf lib:  /home/user/software/tensorflow-gpu-2.4/lib/libtensorflow_cc.so;/home/user/software/tensorflow-gpu-2.4/lib/libtensorflow_framework.so
  set tf intra_op_parallelism_threads: 0
  set tf inter_op_parallelism_threads: 0
  >>> Info of lammps module:
  use deepmd-kit at:  /home/user/deepmd-kit/dp
  source:             v2.0.0
  source branch:      HEAD
  source commit:      1a25414
  source commit at:   2021-08-28 08:15:38 +0800
  build float prec:   double
  build with tf inc:  /home/user/software/tensorflow-gpu-2.4/include;/home/user/software/tensorflow-gpu-2.4/include
  build with tf lib:  /home/user/software/tensorflow-gpu-2.4/lib/libtensorflow_cc.so;/home/user/software/tensorflow-gpu-2.4/lib/libtensorflow_framework.so
2021-09-06 05:07:15.944785: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-09-06 05:07:16.074100: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-06 05:07:16.083678: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-09-06 05:07:16.083750: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-09-06 05:07:16.086385: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:3b:00.0 name: A100-PCIE-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-09-06 05:07:16.086427: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-09-06 05:07:16.093307: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-09-06 05:07:16.093349: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-09-06 05:07:16.094544: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-09-06 05:07:16.094890: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-09-06 05:07:16.095777: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2021-09-06 05:07:16.096861: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-09-06 05:07:16.097030: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-09-06 05:07:16.101095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-09-06 05:07:17.603292: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-09-06 05:07:17.603340: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2021-09-06 05:07:17.603350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2021-09-06 05:07:17.609755: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 36482 MB memory) -> physical GPU (device: 0, name: A100-PCIE-40GB, pci bus id: 0000:3b:00.0, compute capability: 8.0)
2021-09-06 05:07:17.661374: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3000000000 Hz
  >>> Info of model(s):
  using   1 model(s): ../../../model/water/graph-original.pb 
  rcut in model:      6
  ntypes in model:    2
pair_coeff      * *

velocity        all create 330.0 23456789

fix             1 all nvt temp 330.0 330.0 0.5
timestep        0.0005
thermo_style    custom step pe ke etotal temp press vol
thermo          20
# dump              1 all custom 100 water.dump id type x y z

run             99

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:
- USER-DEEPMD package:
The log file lists these citations in BibTeX format.

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Neighbor list info ...
  update every 50 steps, delay 0 steps, check no
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 8
  ghost atom cutoff = 8
  binsize = 4, bins = 4 4 4
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair deepmd, perpetual
      attributes: , newton on
      pair build: full/bin/atomonly
      stencil: half/bin/3d/tri
      bin: standard
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.0005
2021-09-06 05:07:18.718134: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-09-06 05:07:21.396912: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
Per MPI rank memory allocation (min/avg/max) = 2.650 | 2.650 | 2.650 Mbytes
Step PotEng KinEng TotEng Temp Press Volume 
       0   -29915.087    8.1472669    -29906.94          330    3350.9587    1927.3176 
      20   -29917.854    10.912417   -29906.942    442.00069   -10514.011    1927.3176 
      40   -29921.478    14.535179   -29906.943    588.73844   -6586.5916    1927.3176 
      60   -29896.586    22.764435   -29873.822    922.05934    5369.5048    1927.3176 
      80   -29903.143    29.254954   -29873.889    1184.9538     5065.801    1927.3176 
      99   -29907.677     33.63073   -29874.046    1362.1919    6240.4274    1927.3176 
Loop time of 0.893113 on 1 procs for 99 steps with 192 atoms

Performance: 4.789 ns/day, 5.012 hours/ns, 110.848 timesteps/s
96.5% CPU use with 1 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 0.88632    | 0.88632    | 0.88632    |   0.0 | 99.24
Neigh   | 0.0024072  | 0.0024072  | 0.0024072  |   0.0 |  0.27
Comm    | 0.0020397  | 0.0020397  | 0.0020397  |   0.0 |  0.23
Output  | 0.00038211 | 0.00038211 | 0.00038211 |   0.0 |  0.04
Modify  | 0.0014147  | 0.0014147  | 0.0014147  |   0.0 |  0.16
Other   |            | 0.0005449  |            |       |  0.06

Nlocal:        192.000 ave         192 max         192 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost:        2094.00 ave        2094 max        2094 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs:         0.00000 ave           0 max           0 min
Histogram: 1 0 0 0 0 0 0 0 0 0

Total # of neighbors = 0
Ave neighs/atom = 0.0000000
Neighbor list builds = 1
Dangerous builds not checked
Total wall time: 0:00:06

njzjz commented 3 years ago

Reproduced. Using the same model, 29Oct2020 was fine, but 30Jul2021 went wrong. Maybe something is incompatible with LAMMPS new version.

njzjz commented 3 years ago

I am going to test multiple LAMMPS versions to find the problem.

[x] 8Apr2021 ok!
[x] 14May2021 forces are wrong
[ ] 27May2021
[ ] 2Jul2021
[ ] 28Jul2021
[x] 30Jul2021 forces are wrong

njzjz commented 3 years ago

There must something between LAMMPS 8Apr2021 and 14May2021 break the behavior. But I didn't get it. cc @amcadmus

njzjz commented 3 years ago

Fixed by #1128

deepmodeling / deepmd-kit