Investigate performance of other multiply kernels

OrderN / CONQUEST-release

Full public release of large scale and linear scaling DFT code CONQUEST

MIT License

96 stars 25 forks source link

Run times on young. 8 MPI ranks / 4 OpenMP threads per rank. Using inputs from matrix_multiply in https://github.com/OrderN/CONQUEST-release/pull/262 These are just single runs at this point so might contain some variation.

Best case is ompDoik which gives 2x speedup with 4 threads.

ompTsk segfaulted and didn't produce a run time. Based on the comments in the previous eCSE report it did not seem worth debugging further at this point.

$ tail -n 1 */Conquest_out
==> default/Conquest_out <==
    Total run time was:             140.508 seconds

==> gemm/Conquest_out <==
    Total run time was:             107.899 seconds

==> ompDoii/Conquest_out <==
    Total run time was:              82.969 seconds

==> ompDoik/Conquest_out <==
    Total run time was:              69.586 seconds

==> ompDoji/Conquest_out <==
    Total run time was:              85.187 seconds

==> ompDojk/Conquest_out <==
    Total run time was:              72.171 seconds

==> ompGemm/Conquest_out <==
    Total run time was:              74.297 seconds

==> ompTsk/Conquest_out <==

==> default/Conquest_out <== Total run time was: 140.508 seconds ==> gemm/Conquest_out <== Total run time was: 107.899 seconds ==> ompDoii/Conquest_out <== Total run time was: 82.969 seconds ==> ompDoik/Conquest_out <== Total run time was: 69.586 seconds ==> ompDoji/Conquest_out <== Total run time was: 85.187 seconds ==> ompDojk/Conquest_out <== Total run time was: 72.171 seconds ==> ompGemm/Conquest_out <== Total run time was: 74.297 seconds ==> ompGemm_m/Conquest_out <== Total run time was: 69.116 seconds ==> ompTsk/Conquest_out <==

OrderN / CONQUEST-release

Investigate performance of other multiply kernels #268