4d nfft is distinctly slower than comparable 3d nfft (without and with OpenMP)

For instance, on a computer with Intel i7-6700 CPU @ 3.40GHz using 4 threads, a 3d adjoint nfft with M=28^3, N=(28,28,28), n=(56,56,56), m=10 requires approx. 0.1 s and the corresponding ndft approx. 9.2 s, whereas a 4d adjoint nfft with M=12^4, N=(12,12,12,12), n=(24,24,24,24), m=4 requires approx. 1.2 s and the corresponding ndft approx. 8.2 s.

The main runtime is required by the first step of the adjoint nfft or the third step of the nfft (matrix B). Part of this issue for the adjoint version is also caused by the fact that the flag NFFT_OMP_BLOCKWISE is currently not implemented for d>=4 and ignored.

However, also when compiling without OpenMP support, the single-threaded non-OpenMP version is still distinctly slower in the case d=4 compared to d=3. This may be due to better optimized code for the cases d=1,2,3.

NFFT / nfft

4d nfft is distinctly slower than comparable 3d nfft (without and with OpenMP) #64