Update SCTL and add new implementation of vec-kernels

lu1and10 commented 2 years ago

Thanks @dmalhotra. I tested with some nds, the new implementation with mat-vec for nds is faster for all nds and more than 2x improve on nd = 64. I minor thing, seems the compilation break with gcc 7 on FI machine. @mrachh I think it's good to switch to the new implementation.

1 thread, avx 512, ns = nt = 1000

nd =1 +-Unvectorized 9.7706s +-Vectorized new 0.3578s +-Vectorized old 0.4157s

nd = 2 +-Unvectorized 9.9316s +-Vectorized new 0.4116s +-Vectorized old 0.4357s

nd = 4 +-Unvectorized 9.9445s +-Vectorized new 0.4603s +-Vectorized old 0.4974s

nd = 8 +-Unvectorized 10.2272s +-Vectorized new 0.6491s +-Vectorized old 0.6836s

nd = 16 +-Unvectorized 10.9647s +-Vectorized new 0.7871s +-Vectorized old 0.9899s

nd = 32 +-Unvectorized 12.2969s +-Vectorized new 1.0471s +-Vectorized old 2.0475s

nd = 64 +-Unvectorized 13.9630s +-Vectorized new 1.6150s +-Vectorized old 4.0064s

mrachh commented 2 years ago

Thanks @dmalhotra for implementing this, and thanks @lu1and10 for your timing results. It is now merged into the repo.

dmalhotra commented 2 years ago

The gradient kernels are slightly slower for the new implementation, but they becomes faster for nd>5.

A nice feature of the new implementation is that it cleanly implements all the different variants cp, cg, dp, dg, cdp, and cdg in a single function which is easier to maintain. I haven't implemented the Hessian kernels, but it should be easy to add those. Also, the new implementation allows you to set the accuracy and could give a further speedup for lower accuracies.

flatironinstitute / FMM3D

Update SCTL and add new implementation of vec-kernels #25