Closed dmalhotra closed 2 years ago
Thanks @dmalhotra for implementing this, and thanks @lu1and10 for your timing results. It is now merged into the repo.
The gradient kernels are slightly slower for the new implementation, but they becomes faster for nd>5.
A nice feature of the new implementation is that it cleanly implements all the different variants cp, cg, dp, dg, cdp, and cdg in a single function which is easier to maintain. I haven't implemented the Hessian kernels, but it should be easy to add those. Also, the new implementation allows you to set the accuracy and could give a further speedup for lower accuracies.
Thanks @dmalhotra. I tested with some nds, the new implementation with mat-vec for nds is faster for all nds and more than 2x improve on nd = 64. I minor thing, seems the compilation break with gcc 7 on FI machine. @mrachh I think it's good to switch to the new implementation.
1 thread, avx 512, ns = nt = 1000
nd =1 +-Unvectorized 9.7706s +-Vectorized new 0.3578s +-Vectorized old 0.4157s
nd = 2 +-Unvectorized 9.9316s +-Vectorized new 0.4116s +-Vectorized old 0.4357s
nd = 4 +-Unvectorized 9.9445s +-Vectorized new 0.4603s +-Vectorized old 0.4974s
nd = 8 +-Unvectorized 10.2272s +-Vectorized new 0.6491s +-Vectorized old 0.6836s
nd = 16 +-Unvectorized 10.9647s +-Vectorized new 0.7871s +-Vectorized old 0.9899s
nd = 32 +-Unvectorized 12.2969s +-Vectorized new 1.0471s +-Vectorized old 2.0475s
nd = 64 +-Unvectorized 13.9630s +-Vectorized new 1.6150s +-Vectorized old 4.0064s