Request: possibly replace FFTW with CUFFT

fbergama commented 6 years ago

I'm not sure if this would provide a real benefit in terms of processing speed, but the idea of replacing FFTW with the CUDA-compabible implementation CUFFT is somehow appealing.

I've tried to replace the #include< fftw3.h> with #include< cufftw.h > but the compilation fails because CUFFT do not support double precision. Is there a way to disable long doubles via configure script?

tvolkmer commented 6 years ago

Please try ./configure --enable-single

fbergama commented 6 years ago

I did it, but it still complains about the fact that in cufftw there is no fftwl_complex type

tvolkmer commented 6 years ago

I'm sorry, we do not support CUFFT and we currently do not plan to support this in the near future.

The FFTW-compatible interface of CUFFT does not implement all definitions and features of FFTW3 as far as I see.

Some last hints. Maybe it works after following these steps, maybe not:

In m4/ax_lib_fftw3.m4 around line 76 replace the line AC_SEARCH_LIBS([fftw${PREC_SUFFIX}_execute], [fftw3${PREC_SUFFIX} fftw3${PREC_SUFFIX}-3], [ax_lib_fftw3=yes], [ax_lib_fftw3=no], [-lm]) by AC_SEARCH_LIBS([fftw${PREC_SUFFIX}_execute], [cufftw], [ax_lib_fftw3=yes], [ax_lib_fftw3=no], [-lm])
Run ./bootstrap.sh

Edit include/nfft3.h:

Replace the include fftw3.h by cufftw.h
Remove all lines containing the string LONG_DOUBLE

Insert after #define NFFT_CONCAT(prefix, name) prefix ## name:

#define FFTW_CONCAT(prefix, name) prefix ## name
#define FFTW_MANGLE_DOUBLE(name) FFTW_CONCAT(fftw_, name)
#define FFTW_MANGLE_FLOAT(name) FFTW_CONCAT(fftwf_, name)
typedef enum fftw_r2r_kind_do_not_use_me {
FFTW_R2HC=0, FFTW_HC2R=1, FFTW_DHT=2,
FFTW_REDFT00=3, FFTW_REDFT01=4, FFTW_REDFT10=5, FFTW_REDFT11=6,
FFTW_RODFT00=7, FFTW_RODFT01=8, FFTW_RODFT10=9, FFTW_RODFT11=10
} fftwf_r2r_kind, fftw_r2r_kind;

Edit include/infft.h: Replace the include fftw3.h by cufftw.h
Run ./configure with your desired flags without --enable-openmp

fbergama commented 6 years ago

Thank you very much for your assistance. Indeed, I was in the right direction! Just before your last reply I tried to just link cufftw lib instead of fftw3 (by modifying the configure scripts as you suggested and removing matlab/openmp/etc). Looks like that is working. I mean, if I run examples/nfft/nfft_times I see the GPU utilization going up to over 80%. Now I need to check if also the numbers are ok...

If you are interested I can keep you updated if I make any progress.

tvolkmer commented 6 years ago

This would be very interesting.

Please note that the NFFT consists of three major steps and the FFT is only one of these steps. Parallelizing only the FFT with CUDA will probably not give much speedup.

Please also have a look at the CUDA-based implementation Nonequispaced FFTs on GPUs from the homepage of Stefan Kunis.

fbergama commented 6 years ago

Oh god, the GPU implementation of Stefan Kunis was exactly what I was looking for! Thank you so much!

Anyway, I've checked the results of the NFFT library compiled with CUFFT. Numbers are all correct (same results than the CPU version) but the compute time is even higher due to the overhead introduced by the host-device copy. The FFT itself is faster (as expected) but overall the performance of the transform is worse.

For reference, I've collected the results of the nfft_times demo:

https://www.overleaf.com/read/vcfsmdqcpknw

I think that you can close the issue if you want. I'll definitely look at the Kunis implementation. Thanks again

tvolkmer commented 6 years ago

Thank you very much for your feedback and the results. I'm glad we could help.

tvolkmer commented 6 years ago

There may be another implementation available at https://github.com/gadgetron/gadgetron/tree/master/toolboxes/nfft/gpu

NFFT / nfft

Request: possibly replace FFTW with CUFFT #75