I calculate a system with 32 water molecules using "ks_solver" as "cusolver". I use one GPU for calculation.
I find that if I use multiprocessing, for example, running ABACUS by "OMP_NUMTHREADS=1 mpirun -n 12 abacus", the total time for 10 steps of MD is 1862s. However, if I use multithreading, for example, running ABACUS by "OMP_NUMTHREADS=12 mpirun -n 1 abacus", the total time for 10 steps of MD is 5920s. The latter is much lower!
Examples and corresponding results are provided here.cusolver_mpi_openmp.zip
Task list for Issue attackers (only for developers)
[ ] Reproduce the performance issue on a similar system or environment.
[ ] Identify the specific section of the code causing the performance issue.
[ ] Investigate the issue and determine the root cause.
[ ] Research best practices and potential solutions for the identified performance issue.
[ ] Implement the chosen solution to address the performance issue.
[ ] Test the implemented solution to ensure it improves performance without introducing new issues.
[ ] Optimize the solution if necessary, considering trade-offs between performance and other factors (e.g., code complexity, readability, maintainability).
[ ] Review and incorporate any relevant feedback from users or developers.
[ ] Merge the improved solution into the main codebase and notify the issue reporter.
Details
I calculate a system with 32 water molecules using "ks_solver" as "cusolver". I use one GPU for calculation. I find that if I use multiprocessing, for example, running ABACUS by "OMP_NUMTHREADS=1 mpirun -n 12 abacus", the total time for 10 steps of MD is 1862s. However, if I use multithreading, for example, running ABACUS by "OMP_NUMTHREADS=12 mpirun -n 1 abacus", the total time for 10 steps of MD is 5920s. The latter is much lower! Examples and corresponding results are provided here.cusolver_mpi_openmp.zip
Task list for Issue attackers (only for developers)