Closed initqp closed 1 year ago
It might be possible, will give a try. You are also welcome to try it out if you know how it should be done. I have very limited knowledge about openMP.
Thanks! However, since I'm not good at parallel programming, maybe I should wait for you or other experts to do it.
OK, I have played with openMP a little before and I think it is not trivial to speed up whenever there are scatter operations, such as accumulating force/virial for neighbor atoms $j$ in the loop of the central atom $i$.
Could you try the version in the branch of this PR: https://github.com/brucefan1983/NEP_CPU/pull/18 It should speed up by about 2X.
You are also welcome to comment on the PR.
Ok, I will give the new branch a try.
Results: can confirm a solid 1.75x speedup under the 250-atoms PbTe system. Huge improvement!
Ok, I will give the new branch a try.
Results: can confirm a solid 1.75x speedup under the 250-atoms PbTe system. Huge improvement!
Thanks for the quick tests. Note that the current interface for the non-LAMMPS part (should be the one you used) does not accept a neighbor list. It calculates the neighbor list every time. Therefore, this interface is not suitable for large systems. Also this interface assumed that the box is periodic in all directions, which you may have noticed. Do you think these should be improved? I think the neighbor list is now the bottleneck!
For the 250-atom PbTe case:
main
, I got
Computational speed = 23412.6 atom-step/second.
Computational cost = 0.042712 mini-second/atom-step.
Computational speed = 46895.5 atom-step/second.
Computational cost = 0.021324 mini-second/atom-step.
energy/force/virial
1000 times, I got
Computational speed = 74338.4 atom-step/second.
Computational cost = 0.013452 mini-second/atom-step.
NEP_CPU
repo becasue I suppose it is either used for active-learning purpose or used for LAMMPS which provides a neighbor list.Thanks for the detailed testing. In general, I have the following viewpoints, although some of them may be unfair:
I fully agree with you on the three points. Currently only calorine
, pynep
, and somd
have used this interface and I think you all noticed this assumption about the periodic boundary conditions (as there is not input for this).
I will try openMP at some point and keep you informed if there is any progress.
I have done the easy parts:
find_neighbor_list_small_box()
and find_descriptor_small_box()
are suitable for applying an easy omp parallel for
.omp parallel for
in these functions, similar to the GPUMD implemention for the large box case where CUDA atomic functions are avoided. As I have not devised the algorithm in NEP_CPU
in that way from the start, it would be too involved to make the refactoring now.That's amazing, it seems that the main branch is much faster now! Since the following implementations of OpenMP support would require relatively large refactoring (or you may want to invoke critical sections, which might be slow), I think this could be a less urgent and long-term task. So feel free to close this issue if it hangs for too long.
Ok I think it can be closed. The refactoring is worth to do but I leave it for the future.
Is it possible to enable OpenMP parallel computing in NEP_CPU? This may speed up the computation in some use cases. Thanks in advance!