[BUG] PBC inference very slow in v2.0

deepmodeling / deepmd-kit

A deep learning package for many-body potential energy representation and molecular dynamics

https://docs.deepmodeling.com/projects/deepmd/

GNU Lesser General Public License v3.0

1.45k stars 499 forks source link

[BUG] PBC inference very slow in v2.0 #713

Closed njzjz closed 3 years ago

njzjz commented 3 years ago

Summary

Using PBC to predict my system is very slow in both v2.0.0.b0 and v2.0.0.b1, for both Python and C++. Non-PBC works well. The model converted from v1.3.3 or frozen in v2.0.0.b1 have the same behavior below.

the version of DP program	PBC	non-PBC
v1.3.3	Fast	Fast
v2.0.0.b1	Very slow	Fast

Deepmd-kit version, installation way, input file, running commands, error log, etc.

v2.0.0.b1 conda GPU cuda10.1

Steps to Reproduce

import dpdata
s1=dpdata.System("input", fmt="deepmd/raw")
s1p=s1.predict("g2.pb")

dpdata is https://github.com/deepmodeling/dpdata/pull/162

Model input is as the same as #658.

Further Information, Files, and Links

amcadmus commented 3 years ago

How large is the PBC system?

njzjz commented 3 years ago

How large is the PBC system?

Here is my system: input.zip

My model:

v2.0 version: g2.zip
v1.3 version: g2o.zip

amcadmus commented 3 years ago

How large is the PBC system?

Here is my system: input.zip

My model:

v2.0 version: g2.zip

v1.3 version: g2o.zip

I find nothing in the input.zip.

njzjz commented 3 years ago

input.zip

iProzd commented 3 years ago

There’s a collapsed data point in the input coords which contains a very large number:

s1.data["coords"][0][274] array([1.88329067e+01, 1.10405912e+17, 1.80823460e+01])

The v2.0 takes a while function to normalize the coord by cutting down 1 box-scale in one loop, so it stucked (cpu and gpu). (In v1.x, it only cut down 1 box-scale once, which is not expected but passed.)

I can figure out the simple solution to replace the while function from:

while(ri[dd] >= 1.) ri[dd] -= 1.; while(ri[dd] < 0.) ri[dd] += 1.;

to:

ri[dd]=ri[dd]-(long long int)ri[dd]; if (ri[dd] < 0.) ri[dd] += 1.;

Or we add an assert to exit when it encounters a very large number.

njzjz commented 3 years ago

I just realized that I missed a dot when I copy the numbers! (So 1.10405912e+17 should be 1.10405912e+01)

njzjz commented 3 years ago

I can figure out the simple solution to replace the while function from:

while(ri[dd] >= 1.) ri[dd] -= 1.; while(ri[dd] < 0.) ri[dd] += 1.;

to:

ri[dd]=ri[dd]-(long long int)ri[dd]; if (ri[dd] < 0.) ri[dd] += 1.;

Although my input is incorrect, but for the current program, why not use something like

ri[dd] = fmod(ri[dd], 1.);
if (ri[dd] < 0.) ri[dd] += 1.;

See https://www.cplusplus.com/reference/cmath/fmod/ https://developer.download.nvidia.com/cg/fmod.html

iProzd commented 3 years ago

Although my input is incorrect, but for the current program, why not use something like
ri[dd] = fmod(ri[dd], 1.);
if (ri[dd] < 0.) ri[dd] += 1.;
See https://www.cplusplus.com/reference/cmath/fmod/ https://developer.download.nvidia.com/cg/fmod.html

We used to assume the coords in box-scale should be very small(around 1-2 or less), so the while style should be enough. But this issue reminds us to set something like you said fmod incase of unexpected numbers.