deepmodeling / deepmd-kit

A deep learning package for many-body potential energy representation and molecular dynamics
https://docs.deepmodeling.com/projects/deepmd/
GNU Lesser General Public License v3.0
1.48k stars 508 forks source link

an illegal memory access was encountered when fix npt is used together with a compute command in lammps #472

Closed zhangyongsdu closed 3 years ago

zhangyongsdu commented 3 years ago

For the recent versions of deepmd-kit (from dp1.3.1 to the most recent master version), there might be a bug in the GPU version due to an illegal memory access. This only occur when a compute that will compute potential energy (or any computes that whill rely on potential energy, such as stress) is invoked together with a fix npt command. An example is given below:

fix 2 mobile npt temp ${initTemp} ${initTemp} 0.1 x 1 1 2 y 1 1 2 z 1 1 2 compute potential all pe/atom thermo 100 thermo_style custom step pe ke temp pxx pyy pzz pxy pyz pxz vol dump 2 all custom 1000 ${file}-equal id type x y z c_potential

I guess this is because both the compute and the fix npt commands will call GPU to calculate the potential energy from the deep neural netowrk potential. Availiable memory should all be allocated for the fix npt command, and thus the access to the GPU memory from the compute command will be treated as illegal.

Full descritpion of the error is:

2021-04-05 12:21:07.174540: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered 2021-04-05 12:21:07.174566: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:220] Unexpected Event status: 1 [gadi-gpu-v100-0128:925339] Process received signal [gadi-gpu-v100-0128:925339] Signal: Aborted (6) [gadi-gpu-v100-0128:925339] Signal code: (-6) [gadi-gpu-v100-0128:925339] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x14a2fec2cb20] [gadi-gpu-v100-0128:925339] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x14a2fe88e7ff] [gadi-gpu-v100-0128:925339] [ 2] /lib64/libc.so.6(abort+0x127)[0x14a2fe878c35] [gadi-gpu-v100-0128:925339] [ 3] /scratch/qf9/yxz565/softwares/tensorflow2.3.0_root/lib/libtensorflow_cc.so.2(+0xc94a2b7)[0x14a30e4142b7] [gadi-gpu-v100-0128:925339] [ 4] /scratch/qf9/yxz565/softwares/tensorflow2.3.0_root/lib/libtensorflow_cc.so.2(_ZN10tensorflow8EventMgr10PollEventsEbPN4absl14lts_2020_02_2513InlinedVectorINS0_5InUseELm4ESaIS4_EEE+0x161)[0x14a30de4bf81] [gadi-gpu-v100-0128:925339] [ 5] /scratch/qf9/yxz565/softwares/tensorflow2.3.0_root/lib/libtensorflow_cc.so.2(_ZN10tensorflow8EventMgr8PollLoopEv+0xa4)[0x14a30de4c374] [gadi-gpu-v100-0128:925339] [ 6] /scratch/qf9/yxz565/softwares/tensorflow2.3.0_root/lib/libtensorflow_framework.so.2(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x4b1)[0x14a300d9bb71] [gadi-gpu-v100-0128:925339] [ 7] /scratch/qf9/yxz565/softwares/tensorflow2.3.0_root/lib/libtensorflow_framework.so.2(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x43)[0x14a300d99263] [gadi-gpu-v100-0128:925339] [ 8] /scratch/qf9/yxz565/softwares/tensorflow2.3.0_root/lib/libtensorflow_framework.so.2(+0x1103547)[0x14a300d8a547] [gadi-gpu-v100-0128:925339] [ 9] /lib64/libpthread.so.0(+0x814a)[0x14a2fec2214a] [gadi-gpu-v100-0128:925339] [10] /lib64/libc.so.6(clone+0x43)[0x14a2fe953f23] [gadi-gpu-v100-0128:925339] End of error message

amcadmus commented 3 years ago

@zhangyongsdu is this issue solved? could you please share with us how to fix it?

zhangyongsdu commented 3 years ago

@amcadmus Sorry. I have no idea about the reason of the error.

zhangyongsdu commented 3 years ago

The latest dvelopment version (after 10 May 2021) fixes the issue.

njzjz commented 3 years ago

I think it has been fixed by #391 and #392... and it's only available in v1.3.3 or the version after v2.0.0.a0. So it will not work in v1.3.1.