OrderN / CONQUEST-release

Full public release of large scale and linear scaling DFT code CONQUEST
http://www.order-n.org/
MIT License
96 stars 25 forks source link

Investigate performance of threaded matrix multiply kernel #248

Closed tkoskela closed 7 months ago

tkoskela commented 8 months ago

Once we have closed #195 and #244. We can look into the performance of these threading improvements together with the previously threaded matrix multiply kernels.

The multiply kernel can be selected with the MULT_KERN option in the Makefile. The best place to start is ompGemm, but worth looking at the other options too.

A good test case is:

tkoskela commented 8 months ago

Things to look out for

tkoskela commented 8 months ago

An initial profiling result with the test case described above. Using current develop branch with MULT_KERN = ompGemm

image

My first approach to reduce the inefficiency would be to move the threading to the main loop

https://github.com/OrderN/CONQUEST-release/blob/6bf8f4a8c20fd4fa8f1c7baeb8a6b1f23a6d2408/src/multiply_module.f90#L227

And wrap the MPI communications in !$omp master (or !$omp critical, if we use MPI_THREAD_FUNNELED in mpi_init) regions.

tkoskela commented 8 months ago

It should be possible to declare the parallel region in
https://github.com/OrderN/CONQUEST-release/blob/6bf8f4a8c20fd4fa8f1c7baeb8a6b1f23a6d2408/src/multiply_module.f90#L226 and keep the !$omp do workshare constructs as orhpaned constructs where they are in the multiply_kernel.

https://stackoverflow.com/questions/35347944/fortran-openmp-with-subroutines-and-functions/35361665#35361665

We've tried to implement this in tk-optimise-multiply

tkoskela commented 7 months ago

Conclusions

Performance of multiply kernels

Reducing OMP overhead

Longer matrix range

Next steps

Next we need to get rid of the OMP barriers by overlapping communication with computation. This is addressed in #265