deepmodeling / deepmd-kit

A deep learning package for many-body potential energy representation and molecular dynamics
https://docs.deepmodeling.com/projects/deepmd/
GNU Lesser General Public License v3.0
1.45k stars 499 forks source link

[BUG] deepmd-kit-1.3.1-cuda11.1_gpu-Linux-x86_64 does not work properly with cuda11.1 on gpu RTX3080 #587

Closed jinfeng-data closed 3 years ago

jinfeng-data commented 3 years ago

Summary The lmp md output is unnormal using the model trained with deepmd-kit-1.3.1-cuda11.1_gpu-Linux-x86_64 input-output.zip

As my new gpu RTX3080 only support cuda11.0 or later, I downloaded and installed deepmd-kit-1.3.1-cuda11.1_gpu-Linux-x86_64.sh on my new machine. I trained a model for the ion-water system for 5000000 batches to get a fully converged pes. The training process seems normal, and I checked the loss function, rms of the energy and force. However, when I performed the lmp md simulation using the freezed model, the output energies, temperature,.... on each step were exactly same, which is like the following,

Per MPI rank memory allocation (min/avg/max) = 4.415 | 4.415 | 4.415 Mbytes Step PotEng KinEng TotEng Temp Press Volume 0 -353944.84 15.472439 -353929.37 300 4412.6563 3745.2239 100 -353944.84 15.472439 -353929.37 300 4412.6563 3745.2239 200 -353944.84 15.472439 -353929.37 300 4412.6563 3745.2239 300 -353944.84 15.472439 -353929.37 300 4412.6563 3745.2239 400 -353944.84 15.472439 -353929.37 300 4412.6563 3745.2239 500 -353944.84 15.472439 -353929.37 300 4412.6563 3745.2239 600 -353944.84 15.472439 -353929.37 300 4412.6563 3745.2239 700 -353944.84 15.472439 -353929.37 300 4412.6563 3745.2239

However, when I used the same input date files, and trained the model with other deepmd version with cuda10.0 on my old gpu, the lmp md simulation could be performed normally, and all of the output are normal. Hence I am wondering whether deepmd-kit-1.3.1-cuda11.1_gpu-Linux-x86_64 could support cuda11.1 on RTX3080 ?

Deepmd-kit version, installation way, input file, running commands, error log, etc.

Steps to Reproduce

Further Information, Files, and Links

njzjz commented 3 years ago

I remember this package was packed by @felix5572.

felix5572 commented 3 years ago

Yes, it looks like that there are some problems when we complile lammps + deepmd-kit. For cuda 11.1 + tf2.4

The deepmd-kit itself seems work well ().

import deepmd.DeepPot as DP
import numpy as np
dp = DP('graph.pb')
coord = np.array([[1,0,0], [0,0,1.5], [1,0,3]]).reshape([1, -1])
cell = np.diag(10 * np.ones(3)).reshape([1, -1])
atype = [1,0,1]
e, f, v = dp.eval(coord, cell, atype)

But the lammmps cannot return the correct energy.

I cannot figure out where it goes wrong..

amcadmus commented 3 years ago

Does it work well on other cards than RTX3080 ?

jinfeng-data commented 3 years ago

As we only bought RTX3080 cards, I did not try this version of deepmd on other gpu cards than RTX3080. Could you please fix this problem ? Thanks very much !

njzjz commented 3 years ago

I don't have a RTX3080 cards to test, but @felix5572 , do you compile tensorflow and deepmd-kit with compute capability 8.6?

felix5572 commented 3 years ago

@jinfeng-data Your example can be run on my development computer (CPU only). Could you please have a try to run it on CPU only? And @njzjz I don't have RTX3080 cards too, and I compile it on a cpu only machine @njzjz I could take a look if it is compiled with compute capability 8.6

image

njzjz commented 3 years ago

I cannot reproduce the bug using the v2.0.0.b0 LAMMPS built with cuda11.1 on a 3090 card.

njzjz commented 3 years ago

I just download the package. TF_CUDA_COMPUTE_CAPABILITIES was not set when compiling TensorFlow C++ interface.

This is not a bug of deepmd-kit but the TensorFlow compiling issue. I will close it as we have already provide the correct TensorFlow in the official channel.