Generic FFT - Githubissues

I would like to have optimized FFT algorithms for gr vectors. Wishlist:

Multithreaded
Both in-place and out-of-place variants
Allow bit-reversed transforms for applications where we don't care about the order
Truncated FFTs (should be enough to do this for the power-of-two case)
Radix 4 and split-radix versions
Allow rings to define vectorized, high-radix butterflies to reduce arithmetic overhead
Bluestein/Rader algorithms for prime lengths
Matrix FFT for cache-friendly huge transforms
Optimized/implicit multiplications by low order roots of unity (e.g. +/- i) in relevant rings
Support BSGS twiddle tables or similar to reduce memory for huge transforms
Allow strided and truncated input
Variants: real-symmetric data, DCT, Chebyshev transform, etc
Schonhage-Strassen type FFT and convolution for arbitrary rings without roots of unity

A good start could be to translate acb_dft to the gr format (after which acb_dft could be converted to a thin wrapper around the gr version). This doesn't have all the bells and whistles listed above, but the architecture looks reasonable as a basis for adding more features.

On the other hand, here is what fft_small does:

Arithmetic in Z/nZ using double representation
Truncated, bit-reversed FFTs
Vectorized radix 4 butterflies
Macro-expanded basecases up to length 256

On the other hand, it is notably missing multithreading (multiplications are parallelized by assigning separate FFTs to separate threads), and it is not clear to me from reading the code whether it's using the matrix trick for huge transforms.

Complex as the fft_small code is, it might be easier to reverse-engineer it into more general gr form than reimplementing all the same optimizations from scratch for gr data.

Ideally, a completely generic FFT on top of a dmod ring with vector butterfly overloads could be made efficient enough to subsume fft_small (and even improve it further). I don't know if this is the case, but it would be interesting to try. Even if fft_small can't be matched, it would be nice to have fallback FFT code that works for full 64-bit moduli and without AVX2 support, even if it's a good factor slower than Dan's code.

Anyway, even something pretty naive would work well enough for arbitrary-precision rings.

flintlib / flint

Generic FFT #1418