Closed jinfeng-data closed 3 years ago
I remember this package was packed by @felix5572.
Yes, it looks like that there are some problems when we complile lammps + deepmd-kit. For cuda 11.1 + tf2.4
The deepmd-kit itself seems work well ().
import deepmd.DeepPot as DP
import numpy as np
dp = DP('graph.pb')
coord = np.array([[1,0,0], [0,0,1.5], [1,0,3]]).reshape([1, -1])
cell = np.diag(10 * np.ones(3)).reshape([1, -1])
atype = [1,0,1]
e, f, v = dp.eval(coord, cell, atype)
But the lammmps cannot return the correct energy.
I cannot figure out where it goes wrong..
Does it work well on other cards than RTX3080 ?
As we only bought RTX3080 cards, I did not try this version of deepmd on other gpu cards than RTX3080. Could you please fix this problem ? Thanks very much !
I don't have a RTX3080 cards to test, but @felix5572 , do you compile tensorflow and deepmd-kit with compute capability 8.6?
@jinfeng-data Your example can be run on my development computer (CPU only). Could you please have a try to run it on CPU only? And @njzjz I don't have RTX3080 cards too, and I compile it on a cpu only machine @njzjz I could take a look if it is compiled with compute capability 8.6
I cannot reproduce the bug using the v2.0.0.b0 LAMMPS built with cuda11.1 on a 3090 card.
I just download the package. TF_CUDA_COMPUTE_CAPABILITIES
was not set when compiling TensorFlow C++ interface.
This is not a bug of deepmd-kit but the TensorFlow compiling issue. I will close it as we have already provide the correct TensorFlow in the official channel.
Summary The lmp md output is unnormal using the model trained with deepmd-kit-1.3.1-cuda11.1_gpu-Linux-x86_64 input-output.zip
As my new gpu RTX3080 only support cuda11.0 or later, I downloaded and installed deepmd-kit-1.3.1-cuda11.1_gpu-Linux-x86_64.sh on my new machine. I trained a model for the ion-water system for 5000000 batches to get a fully converged pes. The training process seems normal, and I checked the loss function, rms of the energy and force. However, when I performed the lmp md simulation using the freezed model, the output energies, temperature,.... on each step were exactly same, which is like the following,
Per MPI rank memory allocation (min/avg/max) = 4.415 | 4.415 | 4.415 Mbytes Step PotEng KinEng TotEng Temp Press Volume 0 -353944.84 15.472439 -353929.37 300 4412.6563 3745.2239 100 -353944.84 15.472439 -353929.37 300 4412.6563 3745.2239 200 -353944.84 15.472439 -353929.37 300 4412.6563 3745.2239 300 -353944.84 15.472439 -353929.37 300 4412.6563 3745.2239 400 -353944.84 15.472439 -353929.37 300 4412.6563 3745.2239 500 -353944.84 15.472439 -353929.37 300 4412.6563 3745.2239 600 -353944.84 15.472439 -353929.37 300 4412.6563 3745.2239 700 -353944.84 15.472439 -353929.37 300 4412.6563 3745.2239
However, when I used the same input date files, and trained the model with other deepmd version with cuda10.0 on my old gpu, the lmp md simulation could be performed normally, and all of the output are normal. Hence I am wondering whether deepmd-kit-1.3.1-cuda11.1_gpu-Linux-x86_64 could support cuda11.1 on RTX3080 ?
Deepmd-kit version, installation way, input file, running commands, error log, etc.
Steps to Reproduce
Further Information, Files, and Links