Closed gasagna closed 4 years ago
Possibly related https://github.com/JuliaDSP/DSP.jl/issues/339
In Julia 1.4, FFTW now uses partr threads (#105), which means that plan_rfft
creates a plan that itself (potentially) uses threads. When you wrap the FFTW execution in @threads for
, then partr decides how to schedule the for
threads and the FFTW threads among the physical CPU threads.
Unfortunately, spawning a partr thread has a fairly large overhead (less than a physical hardward thread, but much more than e.g. a cilk thread, and far more than a subroutine call), so this leads to a slowdown for a threaded loop of small transforms.
cc @vtjnash
Could the PARTR explain why there is a sudden jump in both execution time and memory allocations between when Julia is started with 2 and 4 threads respectively?
# 2 threads
julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial:
memory estimate: 10.84 MiB
allocs estimate: 8255
--------------
minimum time: 17.288 ms (0.00% GC)
median time: 17.457 ms (0.00% GC)
mean time: 17.661 ms (0.65% GC)
maximum time: 21.014 ms (0.00% GC)
--------------
samples: 283
evals/sample: 1
# 4 threads
julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial:
memory estimate: 118.42 MiB
allocs estimate: 1308803
--------------
minimum time: 91.844 ms (0.00% GC)
median time: 125.869 ms (25.65% GC)
mean time: 128.205 ms (19.04% GC)
maximum time: 251.423 ms (14.35% GC)
--------------
samples: 39
evals/sample: 1
julia> FFTW.fftw_vendor
:fftw
MWE:
using Pkg; pkg"add DSP#master"
using DSP, BenchmarkTools
img = randn(1000,1000);
kernel = randn(35,35);
typeof(img)
typeof(kernel)
@benchmark DSP.conv($img, $kernel)
Julia version info
julia> versioninfo(verbose=true)
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
Ubuntu 18.04.1 LTS
uname: Linux 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 x86_64
CPU: Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz:
speed user nice sys idle irq
#1 2442 MHz 413888 s 38233 s 84143 s 3291817 s 0 s
#2 2329 MHz 402000 s 31602 s 84474 s 1964856 s 0 s
#3 2335 MHz 309327 s 11461 s 106453 s 2008096 s 0 s
#4 2477 MHz 356923 s 36650 s 85812 s 1956186 s 0 s
Memory: 15.564483642578125 GB (502.1171875 MB free)
Uptime: 113146.0 sec
Load Avg: 0.5146484375 0.63671875 0.728515625
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, sandybridge)
Environment:
JULIA_EDITOR = atom
JULIA_NUM_THREADS = 4
MANDATORY_PATH = /usr/share/gconf/ubuntu.mandatory.path
DEFAULTS_PATH = /usr/share/gconf/ubuntu.default.path
HOME = /home/fredrikb
WINDOWPATH = 2
TERM = xterm-256color
PATH = /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
Note that if you are launching your own threads and want FFTW to execute its own plans serially, you can just do FFTW.set_num_threads(1)
before creating your plans.
Thanks, setting the number of threads manually worked out wonderfully :)
cc @JeffBezanson @keno
I think I just ran into a case where this issue causes convolution code in DSP.jl code to run 43x slower and use 85x more memory, although I'm not yet positive that the root cause is FFTW. Since this seems to be such an issue with DSP convolution, which does not use multithreading itself, I would like to better understand what the current recommendation from FFTW.jl is, or why this is such a problem. The affected function in DSP.jl, conv
, performs a large number of small FFT transformations serially, and performance drops off a cliff when setting JULIA_NUM_THREADS=4
prior to starting Julia. I'm a bit hesitant to make a change with global effect, like setting FFTW.set_num_threads(1)
, inside of conv
. Perhaps there is another workaround in this situation?
A more MWE that captures what's happening in DSP's conv
:
using LinearAlgebra, FFTW, BenchmarkTools
function foo!(input)
s = size(input)
fbuff = similar(input, Complex{eltype(input)}, (div(s[1], 2) + 1, s[2]))
p = plan_rfft(input)
ip = plan_brfft(fbuff, s[1])
for i in 1:53000
mul!(fbuff, p, input)
mul!(input, ip, fbuff)
end
return input
end
A = rand(8, 8);
@benchmark foo!(A)
With four threads I get:
memory estimate: 1.05 GiB
allocs estimate: 10489157
--------------
minimum time: 1.099 s (5.75% GC)
median time: 1.101 s (5.95% GC)
mean time: 1.105 s (5.85% GC)
maximum time: 1.122 s (5.63% GC)
--------------
samples: 5
evals/sample: 1
with one thread I get:
memory estimate: 8.59 KiB
allocs estimate: 124
--------------
minimum time: 10.278 ms (0.00% GC)
median time: 10.889 ms (0.00% GC)
mean time: 11.350 ms (0.00% GC)
maximum time: 25.835 ms (0.00% GC)
--------------
samples: 440
evals/sample: 1
Julia v1.4.1:
julia> using LinearAlgebra, FFTW, BenchmarkTools, Base.Threads
julia> nthreads()
1
julia> function foo!(input)
s = size(input)
fbuff = similar(input, Complex{eltype(input)}, (div(s[1], 2) + 1, s[2]))
p = plan_rfft(input)
ip = plan_brfft(fbuff, s[1])
for i in 1:53000
mul!(fbuff, p, input)
mul!(input, ip, fbuff)
end
return input
end
foo! (generic function with 1 method)
julia> A = rand(8, 8);
julia> FFTW.set_num_threads(1)
julia> @btime foo!($A);
12.559 ms (124 allocations: 8.59 KiB)
julia> FFTW.set_num_threads(4)
julia> @btime foo!($A);
2.904 s (124 allocations: 8.59 KiB)
Julia v1.0.5:
julia> using LinearAlgebra, FFTW, BenchmarkTools, Base.Threads
julia> nthreads()
1
julia> function foo!(input)
s = size(input)
fbuff = similar(input, Complex{eltype(input)}, (div(s[1], 2) + 1, s[2]))
p = plan_rfft(input)
ip = plan_brfft(fbuff, s[1])
for i in 1:53000
mul!(fbuff, p, input)
mul!(input, ip, fbuff)
end
return input
end
foo! (generic function with 1 method)
julia> A = rand(8, 8);
julia> FFTW.set_num_threads(1)
julia> @btime foo!($A);
12.586 ms (126 allocations: 8.75 KiB)
julia> FFTW.set_num_threads(4)
julia> @btime foo!($A);
2.882 s (126 allocations: 8.75 KiB)
Oh sorry, my benchmarks were all on Julia 1.4.1, with JULIA_NUM_THREADS=4
and JULIA_NUM_THREADS=1
.
Is there any way to get the current number of FFTW num_threads
, so a function can reversibly alter it?
Was FFTW's num_threads always set to the number of Julia threads, or has that changed recently?
It's 4 times the number of Julia threads, when they are more than 1: https://github.com/JuliaMath/FFTW.jl/blob/d5a74b99004caeda66587229c6cf1aaf7b40ff8a/src/FFTW.jl#L60. This was introduced in #105
It seems like one workaround is making plans with the FFTW.PATIENT
flag, which according to the FFTW3 docs will change the number of threads depending on the problem size. For the example I gave above, this rescues the performance without having to call set_num_threads
.
fftw_plan_with_nthreads
in FFTW3's library index. Perhaps arguments to FFTW.jl's set_num_threads
could be cached by the module, allowing the creation of an accessor function to get the last number of threads passed to FFTW3.This seems to me like a very large performance regression when using FFTW.jl "out of the box." I understand that enabling threaded plans by default, without regard to the problem size, inherently involves a trade-off in performance between small and large problems. On the one hand, a 240x slowdown of small problems might not be that noticeable when the execution runtime was so short to begin with, while a ~4x (or however many cores a user has) speed up for large problems might translate into seconds saved. However, if people are making plans, they are probably using it many times, and a two orders of magnitude slowdown for small problems can really add up.
Is set_num_threads
going to be part of the supported API of this package? I don't see it in the docs.
Some mention of this performance regression in the docs might be helpful. I'll try to throw a PR together sometime if that would be helpful.
Using the FFTW.PATIENT
flag turns out not to be a great workaround, for the perhaps obvious reason that it can take a long time (e.g. JuliaDSP/DSP.jl#362).
I have a package where I have different threads performing different FFTs and I observe a significant slowdown in the latest
julia-1.4-DEV
Here is a MWE:
On my 2016 MacBook Pro I get these times
i.e. there is significant overhead with multiple threads at small array dimensions.
I am not sure this belongs to
FFTW.jl
or juliaBase
, but replacing theLinearAlgebra.mul!(Â[i], plan, A[i])
line with anything else (e.g.A[i] .+= A[i]
) does not incur in the same penalty, so I am submitting the issue here.