False sharing in multithreaded `parallel_reduce`

I just stumbled across your multithreaded parallel_reduce implementations here. Because you're writing to shared cache-lines from different threads (tmp[threadid()]) in a hot loop (1:N where N may be large) these implementations will very likely suffer (a lot) from false sharing.

I recommend to perform a thread-local reduction first, i.e. something like the following (untested):

using ChunkSplitters

function parallel_reduce(N::I, f::F, x...) where {I<:Integer,F<:Function}
    nt = Threads.nthreads()
    tmp = zeros(nt)
    Threads.@threads :static for (idcs, t) in chunks(1:N, nt)
        tmp[t] = sum(i->f(i, x...), idcs)
    end
    return [sum(tmp)]
end

(One could avoid the ChunkSplitters dependency by using e.g. Iterators.partition.)

JuliaORNL / JACC.jl

False sharing in multithreaded `parallel_reduce` #23