OrderN / CONQUEST-release

Full public release of large scale and linear scaling DFT code CONQUEST
http://www.order-n.org/
MIT License
94 stars 24 forks source link

Data alignment in m_kern_min #337

Closed tkoskela closed 2 months ago

tkoskela commented 2 months ago

In https://github.com/OrderN/CONQUEST-release/blob/64ee0cc06cc71d1799c347aa55ee323c01e66a10/src/multiply_kernel_ompGemm_m.f90#L468-L473 we copy data from one-dimensional sparse arrays b and c into two-dimensional temporary arrays tempb and tempc. This is where most time in m_kern_min is spent. Because of the sparsity in b and c, we have to create temporary (dense) copies of the array to pass to dgemm.

Writes to tempb are contiguous (we are incrementing the first index), but writes to tempc are strided (we increment the second index). A possible improvement would be to flip writes to tempc to be contigous, and to set the second argument to t in the dgemm call https://github.com/OrderN/CONQUEST-release/blob/64ee0cc06cc71d1799c347aa55ee323c01e66a10/src/multiply_kernel_ompGemm_m.f90#L481-L482

tkoskela commented 2 months ago

I tested this on myriad with https://github.com/OrderN/CONQUEST-release/tree/develop/benchmarks/matrix_multiply with

    The calculation will be performed on     8 processes
    The calculation will be performed on     4 threads

The results suggest non-contigous writes perform better than transposing the final array in dgemm.

Original code with strided write to c

Total run time was: 69.872 seconds

Contiguous write to c, transpose in dgemm

Total run time was: 135.430 seconds