in-place fftw for dimension >1

While the out-of-place FFTW is considerably faster than the in-place in 1D, the speed difference seems to be negligible in higher dimensions (d>1). This is confirmed by some tests on Intel Skylake I performed as well as the results on AMD Ryzen. However, the out-of-place uses more memory.

While in NFSFT and NFSOFT, we use changed the default to the in-place FFTW, the NFFT still uses out-of-place by default. I would suggest using in-place as default for d>1.

NFFT / nfft

in-place fftw for dimension >1 #110