Open dlfivefifty opened 1 year ago
This would certainly be helpful to include in the interface, but I'm not sure that execution would end up being any faster than cheb2leg(cheb2leg(X')')
apart from using plans and a temp array for the transpose.
The current plans are already OpenMP parallelized, so that explains why multi-column execution is maybe faster than expected (this is slightly less than half the 2D transform):
julia> n = 10_000; x = randn(n); p = plan_leg2cheb(Float64, n); lmul!(p, ldiv!(p, x)); @time lmul!(p, x);
0.007975 seconds
julia> X = randn(n, n);
julia> lmul!(p, ldiv!(p, X)); @time lmul!(p, X);
5.521245 seconds
julia> n*0.007975/5.52 # close to 18 -- my core count
14.447463768115943
Ideally, we'd want a "cache-oblivious transpose." On a single machine, is this how FFTW does it? https://github.com/FFTW/fftw3/blob/e9c510bf92a4fa848982310d2ecc8c5701aa3f6a/kernel/transpose.c
FFTW allows specifying "regions", for multidimensional FFTs:
It turns out I need this feature for Legendre transforms.... at the moment it just does 1D transforms:
I can add it to FastTransforms.jl, e.g., if I need a 2D Legendre I can do
but I'm curious if this could be SIMD-optimised (or multithreaded) in C to make it faster?