Closed tkoskela closed 7 months ago
Things to look out for
An initial profiling result with the test case described above. Using current develop
branch with MULT_KERN = ompGemm
m_kern_min
and m_kern_max
) :smiley: __kmp_fork_barrier
, __kmpc_barrier
) :frowning_face: My first approach to reduce the inefficiency would be to move the threading to the main loop
And wrap the MPI communications in !$omp master
(or !$omp critical
, if we use MPI_THREAD_FUNNELED
in mpi_init
) regions.
It should be possible to declare the parallel region in
https://github.com/OrderN/CONQUEST-release/blob/6bf8f4a8c20fd4fa8f1c7baeb8a6b1f23a6d2408/src/multiply_module.f90#L226
and keep the !$omp do
workshare constructs as orhpaned constructs where they are in the multiply_kernel
.
We've tried to implement this in tk-optimise-multiply
matrix_multiply
benchmark on 8 ranks/4 threads. Best performance with ompGemm_m
and ompDoik
, roughly 2x speedup with 4 threads compared to the serieal versiontk-optimise-multiply
-> #266 we moved the creation of the OMP parallel region out of the multiply kernel outside the main loop in multiply_module
and wrapped the MPI communications in !$ omp master
. To do that, we had to introduce barriers around the MPI communication to ensure data has arrived before distributing work to compute threads. This was previously guaranteed because the communication was done outside the parallel region. DM.L_range
from 16 to 20 in the matrix_multiply
benchmark using the ompGemm
kernel with previous develop
branch, and the tk-optimise-multiply
branch.tk-optimise-multiply
. However, the overhead from forking threads is reduced by ~30%. Unfortunately this is replaced by time spent in barriers we had to introduce to avoid race conditions.Next we need to get rid of the OMP barriers by overlapping communication with computation. This is addressed in #265
Once we have closed #195 and #244. We can look into the performance of these threading improvements together with the previously threaded matrix multiply kernels.
The multiply kernel can be selected with the
MULT_KERN
option in theMakefile
. The best place to start isompGemm
, but worth looking at the other options too.A good test case is:
Use
Si.ion
from test 002 in the testsuiteUse
Conquest_input
from test 002 in the testsuite, change Grid cutoff to 200Use Coords.dat from the input used in #195 --> This is the
matrix_multiply
performance test in #262[x] #268
[x] Think about strategies for reducing omp overhead
[x] #269