gasagna commented 5 years ago

I have a package where I have different threads performing different FFTs and I observe a significant slowdown in the latest julia-1.4-DEV

Here is a MWE:

using BenchmarkTools
using LinearAlgebra
using Base.Threads
using FFTW

function setup(N, M=4)
    A = [randn(N, N) for i = 1:M]
    Â = [rfft(A[i], [2, 1]) for i = 1:M]
    plan = plan_rfft(A[1], [2, 1])
    return A, Â, plan
end

function test(A, Â, plan)
    @threads for i = 1:length(A)
        LinearAlgebra.mul!(Â[i], plan, A[i])
    end
end

for N = [32, 64, 128, 256]
    args = setup(N, 4)
    @btime test($args...)
end

On my 2016 MacBook Pro I get these times

davide@vega - Desktop$: export JULIA_NUM_THREADS=1; julia mew.jl 
  19.083 μs (11 allocations: 816 bytes)
  63.617 μs (11 allocations: 816 bytes)
  371.126 μs (11 allocations: 816 bytes)
  2.355 ms (11 allocations: 816 bytes)
davide@vega - Desktop$: export JULIA_NUM_THREADS=1; julia-master mew.jl 
  19.733 μs (9 allocations: 784 bytes)
  64.619 μs (9 allocations: 784 bytes)
  372.343 μs (9 allocations: 784 bytes)
  2.351 ms (9 allocations: 784 bytes)
davide@vega - Desktop$: export JULIA_NUM_THREADS=2; julia mew.jl 
  17.831 μs (18 allocations: 1.48 KiB)
  41.078 μs (17 allocations: 1.47 KiB)
  195.599 μs (17 allocations: 1.47 KiB)
  1.197 ms (17 allocations: 1.47 KiB)
davide@vega - Desktop$: export JULIA_NUM_THREADS=2; julia-master mew.jl 
  115.478 μs (1069 allocations: 84.78 KiB)
  160.183 μs (1108 allocations: 87.83 KiB)
  264.680 μs (631 allocations: 51.31 KiB)
  1.289 ms (630 allocations: 51.30 KiB)
davide@vega - Desktop$: export JULIA_NUM_THREADS=4; julia mew.jl 
  19.946 μs (29 allocations: 2.27 KiB)
  31.612 μs (29 allocations: 2.28 KiB)
  113.469 μs (29 allocations: 2.31 KiB)
  654.074 μs (30 allocations: 2.31 KiB)
davide@vega - Desktop$: export JULIA_NUM_THREADS=4; julia-master mew.jl 
  85.846 μs (1651 allocations: 133.36 KiB)
  67.317 μs (1043 allocations: 86.73 KiB)
  235.660 μs (2230 allocations: 178.41 KiB)
  798.563 μs (2478 allocations: 197.66 KiB)

i.e. there is significant overhead with multiple threads at small array dimensions.

I am not sure this belongs to FFTW.jl or julia Base, but replacing the LinearAlgebra.mul!(Â[i], plan, A[i]) line with anything else (e.g. A[i] .+= A[i]) does not incur in the same penalty, so I am submitting the issue here.

baggepinnen commented 4 years ago

stevengj commented 4 years ago

In Julia 1.4, FFTW now uses partr threads (#105), which means that plan_rfft creates a plan that itself (potentially) uses threads. When you wrap the FFTW execution in @threads for, then partr decides how to schedule the for threads and the FFTW threads among the physical CPU threads.

Unfortunately, spawning a partr thread has a fairly large overhead (less than a physical hardward thread, but much more than e.g. a cilk thread, and far more than a subroutine call), so this leads to a slowdown for a threaded loop of small transforms.

cc @vtjnash

baggepinnen commented 4 years ago

Could the PARTR explain why there is a sudden jump in both execution time and memory allocations between when Julia is started with 2 and 4 threads respectively?

# 2 threads
julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial: 
  memory estimate:  10.84 MiB
  allocs estimate:  8255
  --------------
  minimum time:     17.288 ms (0.00% GC)
  median time:      17.457 ms (0.00% GC)
  mean time:        17.661 ms (0.65% GC)
  maximum time:     21.014 ms (0.00% GC)
  --------------
  samples:          283
  evals/sample:     1

# 4 threads
julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial: 
  memory estimate:  118.42 MiB
  allocs estimate:  1308803
  --------------
  minimum time:     91.844 ms (0.00% GC)
  median time:      125.869 ms (25.65% GC)
  mean time:        128.205 ms (19.04% GC)
  maximum time:     251.423 ms (14.35% GC)
  --------------
  samples:          39
  evals/sample:     1

julia> FFTW.fftw_vendor
:fftw

MWE:

using Pkg; pkg"add DSP#master"
using DSP, BenchmarkTools
img = randn(1000,1000);
kernel = randn(35,35);
typeof(img)
typeof(kernel)
@benchmark DSP.conv($img, $kernel)

Julia version info

julia> versioninfo(verbose=true)
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
      Ubuntu 18.04.1 LTS
  uname: Linux 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 x86_64
  CPU: Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz: 
              speed         user         nice          sys         idle          irq
       #1  2442 MHz     413888 s      38233 s      84143 s    3291817 s          0 s
       #2  2329 MHz     402000 s      31602 s      84474 s    1964856 s          0 s
       #3  2335 MHz     309327 s      11461 s     106453 s    2008096 s          0 s
       #4  2477 MHz     356923 s      36650 s      85812 s    1956186 s          0 s

  Memory: 15.564483642578125 GB (502.1171875 MB free)
  Uptime: 113146.0 sec
  Load Avg:  0.5146484375  0.63671875  0.728515625
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, sandybridge)
Environment:
  JULIA_EDITOR = atom
  JULIA_NUM_THREADS = 4
  MANDATORY_PATH = /usr/share/gconf/ubuntu.mandatory.path
  DEFAULTS_PATH = /usr/share/gconf/ubuntu.default.path
  HOME = /home/fredrikb
  WINDOWPATH = 2
  TERM = xterm-256color
  PATH = /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

stevengj commented 4 years ago

Note that if you are launching your own threads and want FFTW to execute its own plans serially, you can just do FFTW.set_num_threads(1) before creating your plans.

baggepinnen commented 4 years ago

Thanks, setting the number of threads manually worked out wonderfully :)

ViralBShah commented 4 years ago

cc @JeffBezanson @keno

galenlynch commented 4 years ago

I think I just ran into a case where this issue causes convolution code in DSP.jl code to run 43x slower and use 85x more memory, although I'm not yet positive that the root cause is FFTW. Since this seems to be such an issue with DSP convolution, which does not use multithreading itself, I would like to better understand what the current recommendation from FFTW.jl is, or why this is such a problem. The affected function in DSP.jl, conv, performs a large number of small FFT transformations serially, and performance drops off a cliff when setting JULIA_NUM_THREADS=4 prior to starting Julia. I'm a bit hesitant to make a change with global effect, like setting FFTW.set_num_threads(1), inside of conv. Perhaps there is another workaround in this situation?

galenlynch commented 4 years ago

A more MWE that captures what's happening in DSP's conv:

using LinearAlgebra, FFTW, BenchmarkTools
function foo!(input)
    s = size(input)
    fbuff = similar(input, Complex{eltype(input)}, (div(s[1], 2) + 1, s[2]))
    p = plan_rfft(input)
    ip = plan_brfft(fbuff, s[1])
    for i in 1:53000
        mul!(fbuff, p, input)
        mul!(input, ip, fbuff)
    end
    return input
end
A = rand(8, 8);
@benchmark foo!(A)

With four threads I get:

  memory estimate:  1.05 GiB
  allocs estimate:  10489157
  --------------
  minimum time:     1.099 s (5.75% GC)
  median time:      1.101 s (5.95% GC)
  mean time:        1.105 s (5.85% GC)
  maximum time:     1.122 s (5.63% GC)
  --------------
  samples:          5
  evals/sample:     1

with one thread I get:

  memory estimate:  8.59 KiB
  allocs estimate:  124
  --------------
  minimum time:     10.278 ms (0.00% GC)
  median time:      10.889 ms (0.00% GC)
  mean time:        11.350 ms (0.00% GC)
  maximum time:     25.835 ms (0.00% GC)
  --------------
  samples:          440
  evals/sample:     1

giordano commented 4 years ago

Julia v1.4.1:

julia> using LinearAlgebra, FFTW, BenchmarkTools, Base.Threads

julia> nthreads()
1

julia> function foo!(input)
           s = size(input)
           fbuff = similar(input, Complex{eltype(input)}, (div(s[1], 2) + 1, s[2]))
           p = plan_rfft(input)
           ip = plan_brfft(fbuff, s[1])
           for i in 1:53000
               mul!(fbuff, p, input)
               mul!(input, ip, fbuff)
           end
           return input
       end
foo! (generic function with 1 method)

julia> A = rand(8, 8);

julia> FFTW.set_num_threads(1)

julia> @btime foo!($A);
  12.559 ms (124 allocations: 8.59 KiB)

julia> FFTW.set_num_threads(4)

julia> @btime foo!($A);
  2.904 s (124 allocations: 8.59 KiB)

Julia v1.0.5:

julia> using LinearAlgebra, FFTW, BenchmarkTools, Base.Threads

julia> nthreads()
1

julia> function foo!(input)
           s = size(input)
           fbuff = similar(input, Complex{eltype(input)}, (div(s[1], 2) + 1, s[2]))
           p = plan_rfft(input)
           ip = plan_brfft(fbuff, s[1])
           for i in 1:53000
               mul!(fbuff, p, input)
               mul!(input, ip, fbuff)
           end
           return input
       end
foo! (generic function with 1 method)

julia> A = rand(8, 8);

julia> FFTW.set_num_threads(1)

julia> @btime foo!($A);
  12.586 ms (126 allocations: 8.75 KiB)

julia> FFTW.set_num_threads(4)

julia> @btime foo!($A);
  2.882 s (126 allocations: 8.75 KiB)

galenlynch commented 4 years ago

Oh sorry, my benchmarks were all on Julia 1.4.1, with JULIA_NUM_THREADS=4 and JULIA_NUM_THREADS=1.

galenlynch commented 4 years ago

Is there any way to get the current number of FFTW num_threads, so a function can reversibly alter it?

giordano commented 4 years ago

117

galenlynch commented 4 years ago

Was FFTW's num_threads always set to the number of Julia threads, or has that changed recently?

giordano commented 4 years ago

It's 4 times the number of Julia threads, when they are more than 1: https://github.com/JuliaMath/FFTW.jl/blob/d5a74b99004caeda66587229c6cf1aaf7b40ff8a/src/FFTW.jl#L60. This was introduced in #105

galenlynch commented 4 years ago

It seems like one workaround is making plans with the FFTW.PATIENT flag, which according to the FFTW3 docs will change the number of threads depending on the problem size. For the example I gave above, this rescues the performance without having to call set_num_threads.

117 seems like it would be difficult to implement since I can't find an accessor-like function counterpart to `fftw_plan_with_nthreads` in FFTW3's library index. Perhaps arguments to FFTW.jl's `set_num_threads` could be cached by the module, allowing the creation of an accessor function to get the last number of threads passed to FFTW3.

This seems to me like a very large performance regression when using FFTW.jl "out of the box." I understand that enabling threaded plans by default, without regard to the problem size, inherently involves a trade-off in performance between small and large problems. On the one hand, a 240x slowdown of small problems might not be that noticeable when the execution runtime was so short to begin with, while a ~4x (or however many cores a user has) speed up for large problems might translate into seconds saved. However, if people are making plans, they are probably using it many times, and a two orders of magnitude slowdown for small problems can really add up.

Is set_num_threads going to be part of the supported API of this package? I don't see it in the docs.

Some mention of this performance regression in the docs might be helpful. I'll try to throw a PR together sometime if that would be helpful.

galenlynch commented 4 years ago

Using the FFTW.PATIENT flag turns out not to be a great workaround, for the perhaps obvious reason that it can take a long time (e.g. JuliaDSP/DSP.jl#362).

JuliaMath / FFTW.jl

slowdown in threaded code from julia 1.2 to julia 1.4-DEV #121

117