[BUG] Nbor list sorting error in lammps with the compressed model

zezhong-zhang commented 3 years ago

Summary

Using the compressed model in Lammps with multiple GPUs leads to "illegal nbor list sorting" error, single GPU does not have this issue.

Deepmd-kit version, installation way, input file, running commands, error log, etc. System: CentOS Linux 7 (Core) with slurm deepmd-kit: 2.0.0.b0 py39_0_cuda10.1_gpu deepmodeling/label/dev lammps-dp: 2.0.0.b0 0_cuda10.1_gpu deepmodeling/label/dev python: 3.9.4 hdb3f193_0 installation: conda 4.10.1 command: srun -n 16 lmp -in in.lammps Input & output file including: in.lammps graph.pb (model not compressed) graph-compress.pb (after compression) log for single GPU log for multiple GPU with srun log for multiple GPU with mpirun the model training parameters g6_sub.lammps -- this is a small test structure hex_loop_2_new.lammps -- this is a large structure

Archive.zip

Steps to Reproduce

srun -n 16 lmp -in in.lammps with the compressed model will yield illegal nbor list sorting, so does mpirun
lmp -in in.lammps with compressed model and single GPU can run.
srun -n 16 lmp -in in.lammps with the model not compressed and multiple GPUs can also run
But in all cases, the output (both mc and md) does not update in the log while dump is working.

Further Information, Files, and Links For the large structure, I have 58673 atoms in the box and run with 16 V100 GPUs. Running with a not compressed model will give CUDA out of memory error. I am wondering what would be a good estimation for the number of GPUs/atom?

denghuilu commented 3 years ago

Could you provide the training data mentioned in the input.json? We want to try different model compression parameters

denghuilu commented 3 years ago

Actually, similar problem can also be found in the original model. As a quick fix, we suggest to set nlist freq to 1 to fix the problem in the original model. We are fixing the problem in model compression as soon as possible.

dfz05 commented 3 years ago

I meet the same problem during MC simulation by using the uncompressed model. When only running MD, the simulation is OK. However, when running a MC+MD, lammps fails randomly and reports the error "illegal nbor list sorting". The code I used is :Deepmd-kit standalone 2.0.0.beta0

njzjz commented 3 years ago

Fixed in #812.

deepmodeling / deepmd-kit

[BUG] Nbor list sorting error in lammps with the compressed model #773