fast_transpose slower than naive_transpose

dgasmith / gau2grid

Fast computation of a gaussian and its derivative on a grid.

https://gau2grid.readthedocs.io/en/latest/

BSD 3-Clause "New" or "Revised" License

29 stars 17 forks source link

fast_transpose slower than naive_transpose #62

Open hokru opened 4 years ago

hokru commented 4 years ago

I've been running various DFT calculations with PSI4 inside Intel's VTune and gg_fast_transpose popped up a top hotspot As as test I exchanged it to gg_naive_transpose and saw a significant speed up (50% for a C60 test) for that function.

It's only 4-5% of the total CPU time for single-points, so no real bottleneck to worry about, but wondering why the blocked-transpose might be so much slower.

dgasmith commented 4 years ago

This is something that I noticed as well. I never quite understood why the block transpose was sometimes bad, but I never dug into it. The best thing to do would be to transpose as we come out of L1 cache to prevent two L1-DRAM round trips.

Sorry for the non answer, its an open question for me as well.

hokru commented 4 years ago

Happy enough with that answer ;-). At least I am not imagining things.

I did put the functions into a simple C++ program. Probably terrible style. https://gist.github.com/hokru/3f16adf5505f49df95ceee024f75b200

Maybe the matrices need to be much larger to get the benefits from blocking.