Closed zhangyongsdu closed 3 years ago
@zhangyongsdu is this issue solved? could you please share with us how to fix it?
@amcadmus Sorry. I have no idea about the reason of the error.
The latest dvelopment version (after 10 May 2021) fixes the issue.
I think it has been fixed by #391 and #392... and it's only available in v1.3.3 or the version after v2.0.0.a0. So it will not work in v1.3.1.
For the recent versions of deepmd-kit (from dp1.3.1 to the most recent master version), there might be a bug in the GPU version due to an illegal memory access. This only occur when a compute that will compute potential energy (or any computes that whill rely on potential energy, such as stress) is invoked together with a fix npt command. An example is given below:
fix 2 mobile npt temp ${initTemp} ${initTemp} 0.1 x 1 1 2 y 1 1 2 z 1 1 2 compute potential all pe/atom thermo 100 thermo_style custom step pe ke temp pxx pyy pzz pxy pyz pxz vol dump 2 all custom 1000 ${file}-equal id type x y z c_potential
I guess this is because both the compute and the fix npt commands will call GPU to calculate the potential energy from the deep neural netowrk potential. Availiable memory should all be allocated for the fix npt command, and thus the access to the GPU memory from the compute command will be treated as illegal.
Full descritpion of the error is:
2021-04-05 12:21:07.174540: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered 2021-04-05 12:21:07.174566: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:220] Unexpected Event status: 1 [gadi-gpu-v100-0128:925339] Process received signal [gadi-gpu-v100-0128:925339] Signal: Aborted (6) [gadi-gpu-v100-0128:925339] Signal code: (-6) [gadi-gpu-v100-0128:925339] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x14a2fec2cb20] [gadi-gpu-v100-0128:925339] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x14a2fe88e7ff] [gadi-gpu-v100-0128:925339] [ 2] /lib64/libc.so.6(abort+0x127)[0x14a2fe878c35] [gadi-gpu-v100-0128:925339] [ 3] /scratch/qf9/yxz565/softwares/tensorflow2.3.0_root/lib/libtensorflow_cc.so.2(+0xc94a2b7)[0x14a30e4142b7] [gadi-gpu-v100-0128:925339] [ 4] /scratch/qf9/yxz565/softwares/tensorflow2.3.0_root/lib/libtensorflow_cc.so.2(_ZN10tensorflow8EventMgr10PollEventsEbPN4absl14lts_2020_02_2513InlinedVectorINS0_5InUseELm4ESaIS4_EEE+0x161)[0x14a30de4bf81] [gadi-gpu-v100-0128:925339] [ 5] /scratch/qf9/yxz565/softwares/tensorflow2.3.0_root/lib/libtensorflow_cc.so.2(_ZN10tensorflow8EventMgr8PollLoopEv+0xa4)[0x14a30de4c374] [gadi-gpu-v100-0128:925339] [ 6] /scratch/qf9/yxz565/softwares/tensorflow2.3.0_root/lib/libtensorflow_framework.so.2(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x4b1)[0x14a300d9bb71] [gadi-gpu-v100-0128:925339] [ 7] /scratch/qf9/yxz565/softwares/tensorflow2.3.0_root/lib/libtensorflow_framework.so.2(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x43)[0x14a300d99263] [gadi-gpu-v100-0128:925339] [ 8] /scratch/qf9/yxz565/softwares/tensorflow2.3.0_root/lib/libtensorflow_framework.so.2(+0x1103547)[0x14a300d8a547] [gadi-gpu-v100-0128:925339] [ 9] /lib64/libpthread.so.0(+0x814a)[0x14a2fec2214a] [gadi-gpu-v100-0128:925339] [10] /lib64/libc.so.6(clone+0x43)[0x14a2fe953f23] [gadi-gpu-v100-0128:925339] End of error message