Closed tkoskela closed 2 months ago
I tested this on myriad with https://github.com/OrderN/CONQUEST-release/tree/develop/benchmarks/matrix_multiply with
The calculation will be performed on 8 processes
The calculation will be performed on 4 threads
The results suggest non-contigous writes perform better than transposing the final array in dgemm
.
c
Total run time was: 69.872 seconds
c
, transpose in dgemm
Total run time was: 135.430 seconds
In https://github.com/OrderN/CONQUEST-release/blob/64ee0cc06cc71d1799c347aa55ee323c01e66a10/src/multiply_kernel_ompGemm_m.f90#L468-L473 we copy data from one-dimensional sparse arrays
b
andc
into two-dimensional temporary arraystempb
andtempc
. This is where most time inm_kern_min
is spent. Because of the sparsity inb
andc
, we have to create temporary (dense) copies of the array to pass todgemm
.Writes to
tempb
are contiguous (we are incrementing the first index), but writes totempc
are strided (we increment the second index). A possible improvement would be to flip writes totempc
to be contigous, and to set the second argument tot
in thedgemm
call https://github.com/OrderN/CONQUEST-release/blob/64ee0cc06cc71d1799c347aa55ee323c01e66a10/src/multiply_kernel_ompGemm_m.f90#L481-L482