tkoskela commented 8 months ago

Once we have closed #195 and #244. We can look into the performance of these threading improvements together with the previously threaded matrix multiply kernels.

The multiply kernel can be selected with the MULT_KERN option in the Makefile. The best place to start is ompGemm, but worth looking at the other options too.

A good test case is:

Use Si.ion from test 002 in the testsuite
Use Conquest_input from test 002 in the testsuite, change Grid cutoff to 200
Use Coords.dat from the input used in #195 --> This is the matrix_multiply performance test in #262
[x] #268
[x] Think about strategies for reducing omp overhead
[x] #269

tkoskela commented 8 months ago

Things to look out for

Can we overlap communication and calculation in https://github.com/OrderN/CONQUEST-release/blob/6bf8f4a8c20fd4fa8f1c7baeb8a6b1f23a6d2408/src/multiply_module.f90#L251

tkoskela commented 8 months ago

An initial profiling result with the test case described above. Using current develop branch with MULT_KERN = ompGemm

Almost all of the time is spent in threaded code (m_kern_min and m_kern_max) :smiley:
There is a lot (more than 50% of run time) of OpenMP overhead (__kmp_fork_barrier, __kmpc_barrier) :frowning_face:

My first approach to reduce the inefficiency would be to move the threading to the main loop

https://github.com/OrderN/CONQUEST-release/blob/6bf8f4a8c20fd4fa8f1c7baeb8a6b1f23a6d2408/src/multiply_module.f90#L227

And wrap the MPI communications in !$omp master (or !$omp critical, if we use MPI_THREAD_FUNNELED in mpi_init) regions.

tkoskela commented 8 months ago

It should be possible to declare the parallel region in
https://github.com/OrderN/CONQUEST-release/blob/6bf8f4a8c20fd4fa8f1c7baeb8a6b1f23a6d2408/src/multiply_module.f90#L226 and keep the !$omp do workshare constructs as orhpaned constructs where they are in the multiply_kernel.

https://stackoverflow.com/questions/35347944/fortran-openmp-with-subroutines-and-functions/35361665#35361665

We've tried to implement this in tk-optimise-multiply

tkoskela commented 7 months ago

Conclusions

Performance of multiply kernels

Tested all multiply kernels using the matrix_multiply benchmark on 8 ranks/4 threads. Best performance with ompGemm_m and ompDoik, roughly 2x speedup with 4 threads compared to the serieal version

Reducing OMP overhead

In tk-optimise-multiply -> #266 we moved the creation of the OMP parallel region out of the multiply kernel outside the main loop in multiply_module and wrapped the MPI communications in !$ omp master. To do that, we had to introduce barriers around the MPI communication to ensure data has arrived before distributing work to compute threads. This was previously guaranteed because the communication was done outside the parallel region.

Longer matrix range

Tested increasing DM.L_range from 16 to 20 in the matrix_multiply benchmark using the ompGemm kernel with previous develop branch, and the tk-optimise-multiply branch.
Total runtime is about 2% longer with tk-optimise-multiply. However, the overhead from forking threads is reduced by ~30%. Unfortunately this is replaced by time spent in barriers we had to introduce to avoid race conditions.

Next steps

Next we need to get rid of the OMP barriers by overlapping communication with computation. This is addressed in #265

OrderN / CONQUEST-release

Investigate performance of threaded matrix multiply kernel #248

Conclusions

Performance of multiply kernels

Reducing OMP overhead

Longer matrix range

Next steps