FFTW / fftw3

DO NOT CHECK OUT THESE FILES FROM GITHUB UNLESS YOU KNOW WHAT YOU ARE DOING. (See below.)
GNU General Public License v2.0
2.75k stars 666 forks source link

r2r fft of 2D dataset in just one drection #221

Closed chi86 closed 3 years ago

chi86 commented 3 years ago

Hi, I would like to perform a fft in one direction for a 2D dataset. Now I use fftw_f = fftw_plan_r2r_2d(imax,kmax, *in, *out, FFTW_R2HC,FFTW_R2HC, FFTW_MEASURE); fftw_b = fftw_plan_r2r_2d(imax,kmax, *out, *in, FFTW_HC2R,FFTW_HC2R, FFTW_MEASURE); for my forward and backwards transformation, where in & out are 2D arrays!

Yet this performs a FFTW_R2HC & FFTW_HC2R for both directions. It would be convenient to set the second r2r kind, so that there is no transformation in this direction. According to: 4.3.6 Real-to-Real Transform Kinds there is currently no kind with this functionality. Is there another way to implement something like this?

stevengj commented 3 years ago

Yes, an FFTW_IDENTITY kind could be convenient here!

In the meantime, you can already achieve what you want by using the advanced or guru interfaces.

chi86 commented 3 years ago

Thank you for the hint! I implemented it now with "fftw_plan_many_r2r", yet I noticed that the case with a transformation in both directions is faster.

When I implement the plans it as following: fftw_f = fftw_plan_r2r_2d(imax,kmax, *in, *out, FFTW_R2HC,FFTW_R2HC, FFTW_MEASURE); fftw_b = fftw_plan_r2r_2d(imax,kmax, *out, *in, FFTW_HC2R,FFTW_HC2R, FFTW_MEASURE); it takes "real 1m43.079"

Whereas if I use the "plan_many", reading kind[0] = FFTW_R2HC; fftw_f = fftw_plan_many_r2r(1,&kmax,imax, *in,&imax, 1,kmax, *out,&imax, 1,kmax, kind, FFTW_MEASURE); kind[0] = FFTW_HC2R; fftw_b = fftw_plan_many_r2r(1,&kmax,imax, *out,&imax, 1,kmax, *in,&imax, 1,kmax, kind, FFTW_MEASURE); it takes "real 2m10.332s".

I prescribed the same imax&kmax for both cases.

I would have expected to see a runtime average by just computing the transformation in one direction, however it took longer.

Is there a faster way?

The final goal of this tests is to benchmark fftw with VFFTPK.

chi86 commented 3 years ago

I just briefly looked through the code and saw, if I am not mistaken, that the "fftw_plan_r2r_2d" plan is implemented as a special form of the "plan_many_r2r". So consequently by proper implementation (which I apparently did not achieve) the implementation with "fftw_plan_many_r2r", essentially just a transformation in one direction has to be faster.

stevengj commented 3 years ago

I think you want &kmax and not &imax for the inembed and onembed parameters.

stevengj commented 3 years ago

Are you timing only the execution of the FFTs (fftw_execute), or are you also timing the plan creation? Only the execution time is meaningful to compare.

chi86 commented 3 years ago

Thanks I fixed that! The result is the same and either, with imax and kmax gives me the analytical solution (I solve a simple poisson equation)

Indeed the timing includes both, yet in order to minimize the impact of the plan creation I create it only once and than I run the execution in a loop 1E6 times.

stevengj commented 3 years ago

Can you time them separately? Just add calls to gettimeofday or similar in your code.

chi86 commented 3 years ago

Ok I did that with the following outcome: fftw_plan_many_r2r Dt plan create 1038 mus Dt plan execute 1381166 mus real 0m1.383s user 0m1.381s sys 0m0.001s Plan r2r_2d

fftw_plan_r2r_2d Dt plan create 3593 mus Dt plan execute 468257 mus real 0m0.472s user 0m0.472s sys 0m0.000s

stevengj commented 3 years ago

What size transform are you benchmarking?

chi86 commented 3 years ago

One data set, which will be transformed will be at least 1024 element long, and around 5E5 of these is stored in an Array. For my purpose I will re-do this for up to 1E6 times.

chi86 commented 3 years ago

I made an example of the code and put it in a repo, to clarify what I want to do. Thank you very much for your help!

stevengj commented 3 years ago

I tried your situation with the includedtests/bench benchmarking program, doing 65536 transforms of size 1024 (R2HC) (specified as k1024f*65536), vs a 65536x1024 2d R2HC transform (specfied as k65536fx1024f), running single-threaded in double precision on my 2.7GHz Intel Core i7 laptop:

Problem: k65536fx1024f, setup: 43.50 s, time: 5.72 s, ``mflops'': 762.62086
Problem: k1024f*65536, setup: 70.00 us, time: 274.65 ms, ``mflops'': 6108.6471

As expected, performing the transforms along just the rows was much faster (20x faster: time is the time for 1 transform, collected over many trials).

Can you try running the FFTW bench program?

chi86 commented 3 years ago

Here is the output for the tests/bench benchmarking on a Desktop i5 4.1GHz:

Problem: k65536fx1024f, setup: 23.94 s, time: 3.34 s, ``mflops'': 1306

Problem: k1024f*65536, setup: 36.03 ms, time: 199.40 ms, ``mflops'': 8413.8

As you expected, also for me the transform along the rows is much faster (for me a factor of 16.75).

Yet this mean, I messed the implementation up. Thanks for your support!

chi86 commented 3 years ago

I fixed now some issues with the allocation of the array and I am now at a point where the fftw_plan_many_r2r is twice as fast as the fftw_plan_r2r_2d.

I looked briefly through the bench.c code and noticed something, which I can't understand. For the r2r api_simple the two dimestions are sz->dims[0].n, sz->dims[1].n, whereas for the api_many the n from static int *mkn(bench_tensor *t) is basically t->dims[0].n and the "howmany" is vecsz->dims[0].n, shouldn't it be vecsz->dims[1].n for the "howmany", otherwise it is just done 1024 times instead of 65536, or am I wrong?

stevengj commented 3 years ago

There is no vecsz->dims[1].n — the vecsz->dims array is a 1-element array if you have a single howmany loop, independent of the length of the t->dims array (whose length is the dimensionality of the transform).

In particular, t and vecsz correspond to dims and howmany_dims, respectively, in the guru interface.

chi86 commented 3 years ago

Ah thank you for the clarification!