deepmodeling / deepmd-kit

A deep learning package for many-body potential energy representation and molecular dynamics
https://docs.deepmodeling.com/projects/deepmd/
GNU Lesser General Public License v3.0
1.45k stars 499 forks source link

[BUG] "cuda assert: invalid argument" when running lmp #662

Closed tuoping closed 3 years ago

tuoping commented 3 years ago

Summary

Error when running lmp in example/water/lmp with command on EHPC:

lmp < in.lammps > log

The deepmd-kit is installed with off-line package deepmd-kit-2.0.0.a1-cuda10.1_gpu-Linux-x86_64.1.sh.

Here is the complete error massage:

2021-05-25 08:42:24.368945: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-05-25 08:42:24.624171: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-05-25 08:42:24.624916: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-05-25 08:42:24.625117: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-25 08:42:24.626966: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:00:08.0 name: Tesla P100-PCIE-16GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 15.90GiB deviceMemoryBandwidth: 681.88GiB/s
2021-05-25 08:42:24.627002: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-05-25 08:42:24.641634: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-05-25 08:42:24.649423: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-05-25 08:42:24.652484: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-05-25 08:42:24.658564: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-05-25 08:42:24.662662: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-05-25 08:42:24.672096: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-05-25 08:42:24.672476: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-25 08:42:24.674331: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-25 08:42:24.676401: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2021-05-25 08:42:24.676449: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-05-25 08:42:26.049835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-05-25 08:42:26.049884: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0 
2021-05-25 08:42:26.049901: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N 
2021-05-25 08:42:26.051038: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-25 08:42:26.052941: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-25 08:42:26.054887: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-25 08:42:26.056778: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14652 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:08.0, compute capability: 6.0)
2021-05-25 08:42:26.815483: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
cuda assert: invalid argument /home/conda/feedstock_root/build_artifacts/libdeepmd_1618987037209/work/source/lib/include/gpu_cuda.h 48

I checked and didn't find a "/home/conda" directory.

Steps to Reproduce

Further Information, Files, and Links

amcadmus commented 3 years ago

deepmd-kit v2.0 has not been released!!! what is the version of your source code? which binary did you use? compiled by yourself? or use conda? or use offline package? How can one reproduce your issue?

amcadmus commented 3 years ago

should be the same issue as #533

njzjz commented 3 years ago

should be the same issue as #533

If so, it should have been fixed?

tuoping commented 3 years ago

It is fixed in package deepmd-kit-2.0.0.b0-cuda10.1_gpu-Linux-x86_64.sh.