Feature request: OpenMP support.

initqp commented 1 year ago

Is it possible to enable OpenMP parallel computing in NEP_CPU? This may speed up the computation in some use cases. Thanks in advance!

brucefan1983 commented 1 year ago

It might be possible, will give a try. You are also welcome to try it out if you know how it should be done. I have very limited knowledge about openMP.

initqp commented 1 year ago

Thanks! However, since I'm not good at parallel programming, maybe I should wait for you or other experts to do it.

brucefan1983 commented 1 year ago

OK, I have played with openMP a little before and I think it is not trivial to speed up whenever there are scatter operations, such as accumulating force/virial for neighbor atoms $j$ in the loop of the central atom $i$.

Could you try the version in the branch of this PR: https://github.com/brucefan1983/NEP_CPU/pull/18 It should speed up by about 2X.

brucefan1983 commented 1 year ago

You are also welcome to comment on the PR.

initqp commented 1 year ago

Ok, I will give the new branch a try.

Results: can confirm a solid 1.75x speedup under the 250-atoms PbTe system. Huge improvement!

brucefan1983 commented 1 year ago

Ok, I will give the new branch a try.

Results: can confirm a solid 1.75x speedup under the 250-atoms PbTe system. Huge improvement!

Thanks for the quick tests. Note that the current interface for the non-LAMMPS part (should be the one you used) does not accept a neighbor list. It calculates the neighbor list every time. Therefore, this interface is not suitable for large systems. Also this interface assumed that the box is periodic in all directions, which you may have noticed. Do you think these should be improved? I think the neighbor list is now the bottleneck!

brucefan1983 commented 1 year ago

For the 250-atom PbTe case:

Using main, I got

Computational speed = 23412.6 atom-step/second.
Computational cost = 0.042712 mini-second/atom-step.

Using the branch with table on, I got

Computational speed = 46895.5 atom-step/second.
Computational cost = 0.021324 mini-second/atom-step.

Continuing with the table branch, if I only build the neighbor list once and calculate the energy/force/virial 1000 times, I got
```
Computational speed = 74338.4 atom-step/second.
Computational cost = 0.013452 mini-second/atom-step.
```
The results above indicate that the neighbor list construction takes 0.008 mini-second/atom-step, which is quite a big part.
For larger systems, the neighbor list constuction will become the bottleneck. Hoever, I have not tried to make an efficient neighbor list construction in this NEP_CPU repo becasue I suppose it is either used for active-learning purpose or used for LAMMPS which provides a neighbor list.

initqp commented 1 year ago

Thanks for the detailed testing. In general, I have the following viewpoints, although some of them may be unfair:

The 3D periodic boundary conditions are good defaults, at least for the current state of NEP. Since the cutoff behavior inside the potential is "strict" (no long-range terms by now), low dimensional PBCs could always be represented by large vacuum regions.
As you mentioned, I invoked the "compute" interface in my code, and the neighbor list construction is indeed expansive. And I think you are correct, there is no need to implement fast neighbor list updating algorithms in this library since the users could use LAMMPS for large-scale CPU calculations.
Thus, regarding the computation speed, I think supporting OpenMP could be a proper choice. With fairly small modifications to the code base, the users could take advantage of multiprocessors, e.g., in active learning or other tasks.

brucefan1983 commented 1 year ago

I fully agree with you on the three points. Currently only calorine, pynep, and somd have used this interface and I think you all noticed this assumption about the periodic boundary conditions (as there is not input for this).

I will try openMP at some point and keep you informed if there is any progress.

brucefan1983 commented 1 year ago

I have done the easy parts:

The functions find_neighbor_list_small_box() and find_descriptor_small_box() are suitable for applying an easy omp parallel for.
The other functions are not easy to parallelize because they have accumulation to $j$ (neighbors of $i$) in the loop of $i$. It requires a refactoring to enable omp parallel for in these functions, similar to the GPUMD implemention for the large box case where CUDA atomic functions are avoided. As I have not devised the algorithm in NEP_CPU in that way from the start, it would be too involved to make the refactoring now.
Anyway, with a few lines of code, I obtained 1.8X speedup for the tabulated version using my 6-core i7-8750H laptop:

initqp commented 1 year ago

That's amazing, it seems that the main branch is much faster now! Since the following implementations of OpenMP support would require relatively large refactoring (or you may want to invoke critical sections, which might be slow), I think this could be a less urgent and long-term task. So feel free to close this issue if it hangs for too long.

brucefan1983 commented 1 year ago

Ok I think it can be closed. The refactoring is worth to do but I leave it for the future.

brucefan1983 / NEP_CPU

Feature request: OpenMP support. #17