FFTW.r2r slow - Githubissues

skeletonyk commented 6 years ago

I was trying to use discrete sine transform which is not in the standard Base library (why is cosine transform there~?) so I turned to use fftw.r2r with FFTW.RODFT00 flag. It turned out to be 2x slower the Matlab dst somehow. Does anyone have the same issue?

This is the result for 2D transform Julia: @time ainv(rand(1000,1000)) 0.612343 seconds (204 allocations: 30.528 MiB, 0.86% gc time)

Matlab: Elapsed time is 0.140380 seconds.

PS I turned the multithreading on in Julia already.

Thanks!

stevengj commented 6 years ago

Possibly Matlab (which I think also calls FFTW for DST/DCT?) is caching the plan? What is your ainv function?

Note that FFTW.RODFT00 computes a type-I DST. It will be much more efficient for 999x999 than for 1000x1000, as explained in the FFTW manual. In general, for a type-I DST, you want n+1 to be highly composite.

stevengj commented 6 years ago

(Note that the dct function computes a type-II DCT. This is probably the most common type of sine/cosine transform — it shows up a lot in signal processing. DSTs do not show up as frequently in signal processing because their boundary conditions are "wrong" for energy compaction. For solving PDEs, you need several types of DCTs and DSTs to handle different boundary conditions, so a single dst or dct function would not be sufficient.)

skeletonyk commented 6 years ago

Thanks for the answer. I turned to use r2r(FFTW.RODFT01) with r2r(FFTW.RODFT10) as inverse (up to a factor). I am still seeing similar speed results.

Then I did a small test and Matlab is showing some interesting behaviors~: Matlab: x = randn(2^7*3^6,3000); tic; y = fft(x); toc

Results: I ran it several times:

Elapsed time is 16.860562 seconds. Elapsed time is 12.541518 seconds. Elapsed time is 7.180677 seconds. Elapsed time is 5.641902 seconds. Elapsed time is 2.557857 seconds. Elapsed time is 1.935052 seconds. Elapsed time is 1.740693 seconds.

Julia: a =rand(2^7*3^6,3000); tic(); fft(a,1);toc();

Results: I asl ran it many times. It stays around this value. elapsed time: 6.946659572 seconds

Is it because Matlab is calculating 'plan' for the repeated fft? Why it is speeding up that much..

Ke

skeletonyk commented 6 years ago

BTW, I also tried the plan_fft in Julia. And the result is: Julia: a =rand(2^73^6,3000); P = plan_fft(a,1) a =rand(2^73^6,3000); tic(); P*a;toc();

result: elapsed time: 7.219601897 seconds

Somehow it is even slower~

In terms of r2r, the following is the case I found r2r or dct is much slower:

Julia: @time dct(rand(181,181),1); 0.047096 seconds (92 allocations: 520.203 KiB)

@time fft(rand(181,181),1); 0.005183 seconds (81 allocations: 1.255 MiB)

stevengj commented 6 years ago

I did @time P*a; twice, and the second time was faster than fft(a,1), so you may be just measuring compilation time. In general, you should think about using the BenchmarkTools package to help you do proper benchmarking. (And you don't need to benchmark doing so many transforms: BenchmarkTools will collect statistics for you.)

Note that fft (and the plan version too) will always copy a real input array to a complex array first before performing the transform. It's possible Matlab is using the real-input FFT functions instead, for a factor of 2 speedup.

181 is a prime number, and the r2r algorithms for prime sizes are even more complicated than the complex-fft algorithms for prime sizes, so you lose a constant factor there. Still the difference is not so great if you benchmark properly (using plans and BenchmarkTools):

julia> a = rand(181,181);

julia> p = plan_fft(a, 1);

julia> pr = FFTW.plan_r2r(a, FFTW.REDFT10, 1); # DCT-II

julia> @btime fft($a, 1);
  1.566 ms (62 allocations: 1.00 MiB)

julia> @btime $p * $a;
  1.457 ms (4 allocations: 1.00 MiB)

julia> @btime $pr * $a;
  1.520 ms (2 allocations: 256.08 KiB)

(It's a bit faster with the FFTW.MEASURE flag.)

But even for a non-prime size like 256, the FFT and DCT execution times are similar for a small transform like this. The basic reason for this is that FFTs in FFTW are more optimized than its DCTs, and in particular complex FFTs make better use of SIMD, especially for small sizes like this. However, the DCTs require much less memory for transforming real arrays than complex-data FFTs. And, in any case, they are different transforms.

skeletonyk commented 6 years ago

Thanks for the detailed reply and the tests. I didn't know I can use @btime before. Thanks.

I did run it a few times to avoid the compilation time. I tried the same code of yours and the results are:

722.920 μs (53 allocations: 1.00 MiB) 223.660 μs (4 allocations: 1.00 MiB) 15.340 ms (2 allocations: 256.08 KiB)

(I switched to another desktop)

Something is not right with my r2r haha. Can I ask which version of Julia are you using? Also did you build FFTW to use MKL library?

stevengj commented 6 years ago

I'm using Julia 0.6.2 with the built-in FFTW (not linked to MKL). (MKL doesn't have the r2r transforms IIRC.)

MikaelSlevinsky commented 5 years ago

@stevengj, when you say

FFTs in FFTW are more optimized than its DCTs, and in particular complex FFTs make better use of SIMD, especially for small sizes like this

you mean that the real-to-real microkernels do use SIMD via SSE2/AVX2?

When I read the fftw_print_plan, it would seem to me that the real-to-real transforms don't use SIMD because they don't say that they do, but I'm not sure.

julia> x = rand(128);

julia> plan_fft(x)
FFTW forward plan for 128-element array of Complex{Float64}
(dft-ct-dit/8
  (dftw-direct-8/6 "t3fv_8_avx2_128")
  (dft-direct-16-x8 "n2fv_16_avx2_128"))

julia> FFTW.plan_r2r(x, FFTW.REDFT01)
FFTW r2r REDFT01 plan for 128-element array of Float64
(redft01e-r2hc-128
  (rdft-r2hc-direct-r2c-128 "r2cf_128"))

julia> FFTW.plan_r2r(x, FFTW.HC2R)
FFTW r2r HC2R plan for 128-element array of Float64
(rdft-hc2r-direct-r2c-128 "r2cb_128")

julia> FFTW.plan_r2r(x, FFTW.R2HC)
FFTW r2r R2HC plan for 128-element array of Float64
(rdft-ct-dit/16
  (hc2hc-direct-16/8 "hf2_16"
    (rdft-r2hc-direct-r2c-16 "r2cf_16")
    (rdft-r2hc01-direct-r2c-16 "r2cfII_16"))
  (rdft-r2hc-direct-r2c-8-x16 "r2cf_8"))

julia>

JuliaMath / FFTW.jl

FFTW.r2r slow #62