Schonhage-Strassen type FFT and convolution for arbitrary rings without roots of unity
A good start could be to translate acb_dft to the gr format (after which acb_dft could be converted to a thin wrapper around the gr version). This doesn't have all the bells and whistles listed above, but the architecture looks reasonable as a basis for adding more features.
On the other hand, here is what fft_small does:
Arithmetic in Z/nZ using double representation
Truncated, bit-reversed FFTs
Vectorized radix 4 butterflies
Macro-expanded basecases up to length 256
On the other hand, it is notably missing multithreading (multiplications are parallelized by assigning separate FFTs to separate threads), and it is not clear to me from reading the code whether it's using the matrix trick for huge transforms.
Complex as the fft_small code is, it might be easier to reverse-engineer it into more general gr form than reimplementing all the same optimizations from scratch for gr data.
Ideally, a completely generic FFT on top of a dmod ring with vector butterfly overloads could be made efficient enough to subsume fft_small (and even improve it further). I don't know if this is the case, but it would be interesting to try. Even if fft_small can't be matched, it would be nice to have fallback FFT code that works for full 64-bit moduli and without AVX2 support, even if it's a good factor slower than Dan's code.
Anyway, even something pretty naive would work well enough for arbitrary-precision rings.
I would like to have optimized FFT algorithms for
gr
vectors. Wishlist:A good start could be to translate
acb_dft
to thegr
format (after whichacb_dft
could be converted to a thin wrapper around thegr
version). This doesn't have all the bells and whistles listed above, but the architecture looks reasonable as a basis for adding more features.On the other hand, here is what
fft_small
does:double
representationOn the other hand, it is notably missing multithreading (multiplications are parallelized by assigning separate FFTs to separate threads), and it is not clear to me from reading the code whether it's using the matrix trick for huge transforms.
Complex as the
fft_small
code is, it might be easier to reverse-engineer it into more generalgr
form than reimplementing all the same optimizations from scratch forgr
data.Ideally, a completely generic FFT on top of a
dmod
ring with vector butterfly overloads could be made efficient enough to subsumefft_small
(and even improve it further). I don't know if this is the case, but it would be interesting to try. Even iffft_small
can't be matched, it would be nice to have fallback FFT code that works for full 64-bit moduli and without AVX2 support, even if it's a good factor slower than Dan's code.Anyway, even something pretty naive would work well enough for arbitrary-precision rings.