Closed tkoskela closed 7 months ago
UPDATE Because of a bug in my script, I had neglected the ompGemm_m
kernel which is optimized to allocate temporary arrays before the main loop, instead of deallocating and reallocating them each loop iteration. It has performance comparable to the ompDoik
kernel.
==> default/Conquest_out <==
Total run time was: 140.508 seconds
==> gemm/Conquest_out <==
Total run time was: 107.899 seconds
==> ompDoii/Conquest_out <==
Total run time was: 82.969 seconds
==> ompDoik/Conquest_out <==
Total run time was: 69.586 seconds
==> ompDoji/Conquest_out <==
Total run time was: 85.187 seconds
==> ompDojk/Conquest_out <==
Total run time was: 72.171 seconds
==> ompGemm/Conquest_out <==
Total run time was: 74.297 seconds
==> ompGemm_m/Conquest_out <==
Total run time was: 69.116 seconds
==> ompTsk/Conquest_out <==
Run times on young. 8 MPI ranks / 4 OpenMP threads per rank. Using inputs from
matrix_multiply
in https://github.com/OrderN/CONQUEST-release/pull/262 These are just single runs at this point so might contain some variation.Best case is
ompDoik
which gives 2x speedup with 4 threads.ompTsk
segfaulted and didn't produce a run time. Based on the comments in the previous eCSE report it did not seem worth debugging further at this point.