[Feature Request] Poor performance of ppm/dplr for long-range interactions

cesaremalosso commented 3 months ago

Summary

kspace_style pppm/dplr is quite slow in LAMMPS, significantly slowing down the MD simulation. A multiprocessing code running on CPU (or a GPU implementation) could speed-up significantly the simulation.

Detailed Description

Hi, I'm running a dplr MD simulations with LAMMPS and I am facing low performances in the long-range part of the calculation. I'm running on 4 GPUs in a single node using 1 MPI process for each gpu. This is the performance report I get at the end of my simulation of 2727 atoms (and 909 wannier centroids):

Performance: 0.156 ns/day, 153.955 hours/ns, 9.021 timesteps/s, 32.802 katom-step/s
100.7% CPU use with 4 MPI tasks x no OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 13.74      | 14.367     | 15.688     |  20.6 |  4.89
Bond    | 0.0011576  | 0.0013793  | 0.0016772  |   0.5 |  0.00
Kspace  | 157.22     | 164.76     | 171.03     |  46.2 | 56.09
Neigh   | 2.6626     | 2.6628     | 2.6631     |   0.0 |  0.91
Comm    | 0.30548    | 0.34742    | 0.43454    |   8.8 |  0.12
Output  | 0.0019177  | 0.0024892  | 0.0027186  |   0.7 |  0.00
Modify  | 105.74     | 111.57     | 117.8      |  50.2 | 37.98
Other   |            | 0.03181    |            |       |  0.01

It seems that thekspace_style pppm/dplr, which is used to account for the long-range interactions, is quite slow in LAMMPS, significantly slowing down the MD simulation. Using more GPUS does not increase significantly the performance since it improve only the Pair time.

Do you think it would be beneficial to implement OpenMP thread parallelization to speed this part up? Perhaps using GPUs for both the short-range NNP and the Wannier NN, while using multiple processes on multiple CPUs for the particle-particle particle-mesh solver? Could a GPU pppm/dplr code also increase the performance?

Further Information, Files, and Links

No response

Yi-FanLi commented 3 months ago

This is a very good question! Actually I have done some benchmarks and discussed with @amcadmus back to August last year. We do think that using the GPU version of pppm/dplr would be beneficial. Similarly, I agree that using a thread parallelization might be useful. I am sorry that I only had very limited time to work on this problem durint the past year. I will try my best to figure it out. Do you have any suggestions or have you done any tests about the solutions to accelerate the pppm?

cesaremalosso commented 2 months ago

Actually I'm not very practical in this kind of coding so I would not be very helpful...I can do some testing if it can be useful!

deepmodeling / deepmd-kit