fft benchmark problems - Githubissues

stevengj commented 9 years ago

You wrote: However, for such a small transform, you're seeing a lot of overhead in fft(a) from setting up the FFT plan, setting up the trig tables, etcetera. To really get the full benefit of FFTW, you need to create a precomputed plan:

julia> p = plan_fft(a);

julia> @time b3 = p*a;
  0.002879 seconds (1.39 k allocations: 97.771 KB)

julia> @time b3 = p*a;
  0.000026 seconds (10 allocations: 32.375 KB)

On my machine, my_fft(a) is about 6 times lower than fft(a), but about 33 times slower than p*a.

stevengj commented 9 years ago

I should also mention that your FFT implementation pays a big price for:

allocating tons of temporary arrays via slicing x[1:2:end] etc
recursing all the way down to N==1 (which means that you see a lot of function-call overhead).
computing the trig factors on the fly by calls to exp rather than precomputing a table (or using a recurrence, although that can sacrifice some accuracy). Note that you are calling exp twice when you only need to call it once, and only need to multiply it by z_odd[k] once.

stevengj commented 9 years ago

Here is a simple "upgrade" that is about 4x faster than my_fft(a) by working in-place on a pre-allocated output array y:

function myfft!(n, y, y0, ys, x, x0, xs)
    if n == 1
        y[y0] = x[x0]
    else
        n2 = n >> 1
        myfft!(n2, y, y0, ys, x, x0, 2*xs)
        y1 = y0 + n2*ys
        myfft!(n2, y, y1, ys, x, x0 + xs, 2*xs)
        θ = -2π/n
        for k = 0:n2-1
            a, b = y[y0+k*ys], y[y1+k*ys] * cis(θ*k)
            y[y0+k*ys] = a + b
            y[y1+k*ys] = a - b
        end
    end
    return y
end
myfft(x) = myfft!(length(x), Array(promote_type(Complex128, eltype(x)), length(x)), 1,1, x, 1,1)

daanhb commented 9 years ago

Very, very helpful, thanks!

Funny coincidence - for the tutorial, I originally had in mind adding a chapter devoted to "in-place methods", and I started writing it based on improving this very same my_fft routine. But the whole thing was getting too long, and I left it out. Your in-place suggestion is faster than the one I had - unsurprisingly. I also tried an in-place variant using views, rather than using explicit offsets and strides, and that code is more readable but so far I could not manage to accomplish zero memory allocation in this way.

As for the tutorial, since the my_fft function is potentially one of the first examples of Julia code the reader ever sees, I don't want to intimidate anyone with bells and whistles. I wanted it to look like a naive Matlab implementation. But I certainly don't want to leave the impression that this is the best you can do in Julia, so I'll adapt the example to make this clear and time permitting I'll add that chapter - the message being that you can heavily optimize functions within Julia, without having to resort to a different language. If I get around to doing that, I hope I can reach out for some more comments, that would be helpful.

stevengj commented 9 years ago

I understand—it's certainly sensible to do the simplest possible implementation for pedagogical purposes. In any case, you should time the plan_fft(a) version when comparing to the built-in routines, however, in addition to fft(a) and my_fft(a).

daanhb commented 8 years ago

For now I've added timings using plan_fft, I had no idea the difference could be so substantial. Although the difference is a lot less on my desktop and laptop than it is on Juliabox: on Juliabox I get timings similar to the ones quoted above, at home I don't. I must be using suboptimal libraries. In any case, with a larger performance gap, the example looses some of its appeal and it would become more compelling again if I add more optimizations in Julia, so I still have that in mind.

daanhb / Julia-tutorial

fft benchmark problems #2