Open hokru opened 4 years ago
This is something that I noticed as well. I never quite understood why the block transpose was sometimes bad, but I never dug into it. The best thing to do would be to transpose as we come out of L1 cache to prevent two L1-DRAM round trips.
Sorry for the non answer, its an open question for me as well.
Happy enough with that answer ;-). At least I am not imagining things.
I did put the functions into a simple C++ program. Probably terrible style. https://gist.github.com/hokru/3f16adf5505f49df95ceee024f75b200
Maybe the matrices need to be much larger to get the benefits from blocking.
I've been running various DFT calculations with PSI4 inside Intel's VTune and
gg_fast_transpose
popped up a top hotspot As as test I exchanged it togg_naive_transpose
and saw a significant speed up (50% for a C60 test) for that function.It's only 4-5% of the total CPU time for single-points, so no real bottleneck to worry about, but wondering why the blocked-transpose might be so much slower.