tl;dr if you use many threads, running FFTW.set_num_threads(1) can be a good idea. Otherwise FFTW probably slows down computation and prevents outer parallelism. I suggest adding it to the README.
Full explanation
I was trying to do a lot of KDE in the loop, but it occurred that running the code in parallel slow down the process. Even if I simply set JULIA_NUM_THREADS=20 (for 56 core server) without using @threads:
using KernelDensity
using Base.Threads
interp_kde(coords::Array{Float64, 2}, bandwidth::Float64) =
InterpKDE(kde((coords[1,:], coords[2,:]), bandwidth=(bandwidth, bandwidth)))
td = rand(2, 100000);
@time for i in 1:500
interp_kde(td, 1.0)
end
It creates multiple threads with loading 30% and takes 15.9 seconds. The same code with JULIA_NUM_THREADS=1 takes 7.5 seconds, working fairly in single thread. Timing doesn't really change if I use `@threads:
@time @threads for i in 1:500
interp_kde(td, 1.0)
end
After some digging, the problem occurred to be in the FFTW package, which is called somewhere during interpolation and by default usesnthreads() * 4 threads inside its C code. To disable it you need to run FFTW.set_num_threads(1). After that, running with JULIA_NUM_THREADS=20 but without @threads takes 7.5 seconds, as it should be, and with @threads it takes 0.5 seconds.
I was trying different run configurations, but at the end, looks like having FFTW parallel improves situation comparing to single thread only with large arrays (>500000) and large number of iterations (>100) And it's always much worse than having outer loop parallel.
tl;dr if you use many threads, running
FFTW.set_num_threads(1)
can be a good idea. Otherwise FFTW probably slows down computation and prevents outer parallelism. I suggest adding it to the README.Full explanation I was trying to do a lot of KDE in the loop, but it occurred that running the code in parallel slow down the process. Even if I simply set
JULIA_NUM_THREADS=20
(for 56 core server) without using@threads
:It creates multiple threads with loading 30% and takes 15.9 seconds. The same code with
JULIA_NUM_THREADS=1
takes 7.5 seconds, working fairly in single thread. Timing doesn't really change if I use `@threads:After some digging, the problem occurred to be in the FFTW package, which is called somewhere during interpolation and by default uses
nthreads() * 4
threads inside its C code. To disable it you need to runFFTW.set_num_threads(1)
. After that, running withJULIA_NUM_THREADS=20
but without@threads
takes 7.5 seconds, as it should be, and with@threads
it takes 0.5 seconds.I was trying different run configurations, but at the end, looks like having FFTW parallel improves situation comparing to single thread only with large arrays (>500000) and large number of iterations (>100) And it's always much worse than having outer loop parallel.