r2r fft of 2D dataset in just one drection

chi86 commented 3 years ago

Hi, I would like to perform a fft in one direction for a 2D dataset. Now I use fftw_f = fftw_plan_r2r_2d(imax,kmax, *in, *out, FFTW_R2HC,FFTW_R2HC, FFTW_MEASURE); fftw_b = fftw_plan_r2r_2d(imax,kmax, *out, *in, FFTW_HC2R,FFTW_HC2R, FFTW_MEASURE); for my forward and backwards transformation, where in & out are 2D arrays!

Yet this performs a FFTW_R2HC & FFTW_HC2R for both directions. It would be convenient to set the second r2r kind, so that there is no transformation in this direction. According to: 4.3.6 Real-to-Real Transform Kinds there is currently no kind with this functionality. Is there another way to implement something like this?

stevengj commented 3 years ago

Yes, an FFTW_IDENTITY kind could be convenient here!

In the meantime, you can already achieve what you want by using the advanced or guru interfaces.

chi86 commented 3 years ago

Thank you for the hint! I implemented it now with "fftw_plan_many_r2r", yet I noticed that the case with a transformation in both directions is faster.

When I implement the plans it as following: fftw_f = fftw_plan_r2r_2d(imax,kmax, *in, *out, FFTW_R2HC,FFTW_R2HC, FFTW_MEASURE); fftw_b = fftw_plan_r2r_2d(imax,kmax, *out, *in, FFTW_HC2R,FFTW_HC2R, FFTW_MEASURE); it takes "real 1m43.079"

Whereas if I use the "plan_many", reading kind[0] = FFTW_R2HC; fftw_f = fftw_plan_many_r2r(1,&kmax,imax, *in,&imax, 1,kmax, *out,&imax, 1,kmax, kind, FFTW_MEASURE); kind[0] = FFTW_HC2R; fftw_b = fftw_plan_many_r2r(1,&kmax,imax, *out,&imax, 1,kmax, *in,&imax, 1,kmax, kind, FFTW_MEASURE); it takes "real 2m10.332s".

I prescribed the same imax&kmax for both cases.

I would have expected to see a runtime average by just computing the transformation in one direction, however it took longer.

Is there a faster way?

The final goal of this tests is to benchmark fftw with VFFTPK.

chi86 commented 3 years ago

I just briefly looked through the code and saw, if I am not mistaken, that the "fftw_plan_r2r_2d" plan is implemented as a special form of the "plan_many_r2r". So consequently by proper implementation (which I apparently did not achieve) the implementation with "fftw_plan_many_r2r", essentially just a transformation in one direction has to be faster.

stevengj commented 3 years ago

I think you want &kmax and not &imax for the inembed and onembed parameters.

stevengj commented 3 years ago

Are you timing only the execution of the FFTs (fftw_execute), or are you also timing the plan creation? Only the execution time is meaningful to compare.

chi86 commented 3 years ago

Thanks I fixed that! The result is the same and either, with imax and kmax gives me the analytical solution (I solve a simple poisson equation)

Indeed the timing includes both, yet in order to minimize the impact of the plan creation I create it only once and than I run the execution in a loop 1E6 times.

stevengj commented 3 years ago

Can you time them separately? Just add calls to gettimeofday or similar in your code.

chi86 commented 3 years ago

Ok I did that with the following outcome: fftw_plan_many_r2r Dt plan create 1038 mus Dt plan execute 1381166 mus real 0m1.383s user 0m1.381s sys 0m0.001s Plan r2r_2d

fftw_plan_r2r_2d Dt plan create 3593 mus Dt plan execute 468257 mus real 0m0.472s user 0m0.472s sys 0m0.000s

stevengj commented 3 years ago

What size transform are you benchmarking?

chi86 commented 3 years ago

One data set, which will be transformed will be at least 1024 element long, and around 5E5 of these is stored in an Array. For my purpose I will re-do this for up to 1E6 times.

chi86 commented 3 years ago

I made an example of the code and put it in a repo, to clarify what I want to do. Thank you very much for your help!

stevengj commented 3 years ago

I tried your situation with the includedtests/bench benchmarking program, doing 65536 transforms of size 1024 (R2HC) (specified as k1024f*65536), vs a 65536x1024 2d R2HC transform (specfied as k65536fx1024f), running single-threaded in double precision on my 2.7GHz Intel Core i7 laptop:

Problem: k65536fx1024f, setup: 43.50 s, time: 5.72 s, ``mflops'': 762.62086
Problem: k1024f*65536, setup: 70.00 us, time: 274.65 ms, ``mflops'': 6108.6471

As expected, performing the transforms along just the rows was much faster (20x faster: time is the time for 1 transform, collected over many trials).

Can you try running the FFTW bench program?

chi86 commented 3 years ago

Here is the output for the tests/bench benchmarking on a Desktop i5 4.1GHz:

Problem: k65536fx1024f, setup: 23.94 s, time: 3.34 s, ``mflops'': 1306

Problem: k1024f*65536, setup: 36.03 ms, time: 199.40 ms, ``mflops'': 8413.8

As you expected, also for me the transform along the rows is much faster (for me a factor of 16.75).

Yet this mean, I messed the implementation up. Thanks for your support!

chi86 commented 3 years ago

I fixed now some issues with the allocation of the array and I am now at a point where the fftw_plan_many_r2r is twice as fast as the fftw_plan_r2r_2d.

I looked briefly through the bench.c code and noticed something, which I can't understand. For the r2r api_simple the two dimestions are sz->dims[0].n, sz->dims[1].n, whereas for the api_many the n from static int *mkn(bench_tensor *t) is basically t->dims[0].n and the "howmany" is vecsz->dims[0].n, shouldn't it be vecsz->dims[1].n for the "howmany", otherwise it is just done 1024 times instead of 65536, or am I wrong?

stevengj commented 3 years ago

There is no vecsz->dims[1].n — the vecsz->dims array is a 1-element array if you have a single howmany loop, independent of the length of the t->dims array (whose length is the dimensionality of the transform).

In particular, t and vecsz correspond to dims and howmany_dims, respectively, in the guru interface.

chi86 commented 3 years ago

Ah thank you for the clarification!

FFTW / fftw3

r2r fft of 2D dataset in just one drection #221