I just stumbled across your multithreaded parallel_reduce implementations here. Because you're writing to shared cache-lines from different threads (tmp[threadid()]) in a hot loop (1:N where N may be large) these implementations will very likely suffer (a lot) from false sharing.
I recommend to perform a thread-local reduction first, i.e. something like the following (untested):
using ChunkSplitters
function parallel_reduce(N::I, f::F, x...) where {I<:Integer,F<:Function}
nt = Threads.nthreads()
tmp = zeros(nt)
Threads.@threads :static for (idcs, t) in chunks(1:N, nt)
tmp[t] = sum(i->f(i, x...), idcs)
end
return [sum(tmp)]
end
(One could avoid the ChunkSplitters dependency by using e.g. Iterators.partition.)
Thanks @carstenbauer for pointing this out. You're the expert so please chime in anytime. We'll restart some of the work as we come back from the holidays.
I just stumbled across your multithreaded
parallel_reduce
implementations here. Because you're writing to shared cache-lines from different threads (tmp[threadid()]
) in a hot loop (1:N
where N may be large) these implementations will very likely suffer (a lot) from false sharing.I recommend to perform a thread-local reduction first, i.e. something like the following (untested):
(One could avoid the ChunkSplitters dependency by using e.g.
Iterators.partition
.)