deepmodeling / deepmd-kit

A deep learning package for many-body potential energy representation and molecular dynamics
https://docs.deepmodeling.com/projects/deepmd/
GNU Lesser General Public License v3.0
1.45k stars 499 forks source link

Segmentation fault (core dumped) when using compute function of NNPInter.h #574

Closed Cloudac7 closed 3 years ago

Cloudac7 commented 3 years ago

Dear all, I'm now trying to write a C++ interface for calculating force as well as energy with structure information given for DeePMD potential. From NNPInter.h, I could just pass the value to the compute function to get the energy and force.

  void compute (ENERGYTYPE &            ener,
        vector<VALUETYPE> &     force,
        vector<VALUETYPE> &     virial,
        vector<VALUETYPE> &     atom_energy,
        vector<VALUETYPE> &     atom_virial,
        const vector<VALUETYPE> &   coord,
        const vector<int> &     atype,
        const vector<VALUETYPE> &   box,
        const vector<VALUETYPE> &   fparam = vector<VALUETYPE>(),
        const vector<VALUETYPE> &   aparam = vector<VALUETYPE>());

However, when executing the compute function from NNPInter.h on GPU, it raise error(shown below):

2021-04-28 17:27:07.360527: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-04-28 17:27:07.437428: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-04-28 17:27:07.441541: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-04-28 17:27:07.442388: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:61:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-04-28 17:27:07.442490: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-04-28 17:27:07.445258: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-04-28 17:27:07.447935: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-04-28 17:27:07.448858: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-04-28 17:27:07.451637: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-04-28 17:27:07.453269: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-04-28 17:27:07.459345: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-04-28 17:27:07.460596: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2021-04-28 17:27:07.460629: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-04-28 17:27:08.326936: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-04-28 17:27:08.326977: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0
2021-04-28 17:27:08.326993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N
2021-04-28 17:27:08.328718: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 29259 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:61:00.0, compute capability: 7.0)
2021-04-28 17:27:08.849037: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
cuda assert: misaligned address /data/share/soft/deepmd-kit/source/op/prod_virial_se_a_gpu.cc 88
2021-04-28 17:27:09.332565: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_MISALIGNED_ADDRESS: misaligned address
2021-04-28 17:27:09.332653: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:220] Unexpected Event status: 1
Segmentation fault (core dumped)

As shown, CUDA_ERROR_MISALIGNED_ADDRESS error raised. From printing each variables passed to the function, coordinations as well as box information are right. And I used cuda-gdb to debug, shows the backtrace information below:

CUDA Exception: Warp Misaligned Address
The exception was triggered at PC 0x2aac30047090

Thread 60 "call" received signal CUDA_EXCEPTION_6, Warp Misaligned Address.
[Switching focus to CUDA kernel 0, grid 4, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 3, lane 0]
0x00002aac300470a0 in get_i_idx_se_a(int, int const*, int*)<<<(1,1,1),(256,1,1)>>> ()

So where might the issue come from and how could I try to fix it? Thanks!

amcadmus commented 3 years ago

Please report 1, the version or commit hash of your source code, 2, how did you use the c++ interface in detail. A smallest piece of code that reproduces your issue is preferred. Thank you.

Cloudac7 commented 3 years ago

Please report 1, the version or commit hash of your source code, 2, how did you use the c++ interface in detail. A smallest piece of code that reproduces your issue is preferred. Thank you.

  1. I used the code with DeePMD-kit v1.2.4 (with TensorFlow C++ Interface 2.3.0). The code could be run at version lower than v1.1 (with TensorFlow C++ Interface<=1.13.2).
  2. Please refer to here for the whole code. For a simple test, C wrapper could be compiled.
denghuilu commented 3 years ago

Please report 1, the version or commit hash of your source code, 2, how did you use the c++ interface in detail. A smallest piece of code that reproduces your issue is preferred. Thank you.

  1. I used the code with DeePMD-kit v1.2.4 (with TensorFlow C++ Interface 2.3.0). The code could be run at version lower than v1.1 (with TensorFlow C++ Interface<=1.13.2).
  2. Please refer to here for the whole code. For a simple test, C wrapper could be compiled.

Thanks for your reply, we will try to solve this problem as soon as possible.

amcadmus commented 3 years ago

Bug due to the use of an out-dated C++ interface. With the latest C++ interface provided by v2.0.0.b1, the code works well.

Cloudac7 commented 3 years ago

Bug due to the use of an out-dated C++ interface. With the latest C++ interface provided by v2.0.0.b1, the code works well.

Thanks, I will try the new version later.