JuliaORNL / JACC.jl

CPU/GPU parallel performance portable layer in Julia via functions as arguments
MIT License
21 stars 13 forks source link

False sharing in multithreaded `parallel_reduce` #23

Open carstenbauer opened 10 months ago

carstenbauer commented 10 months ago

I just stumbled across your multithreaded parallel_reduce implementations here. Because you're writing to shared cache-lines from different threads (tmp[threadid()]) in a hot loop (1:N where N may be large) these implementations will very likely suffer (a lot) from false sharing.

I recommend to perform a thread-local reduction first, i.e. something like the following (untested):

using ChunkSplitters

function parallel_reduce(N::I, f::F, x...) where {I<:Integer,F<:Function}
    nt = Threads.nthreads()
    tmp = zeros(nt)
    Threads.@threads :static for (idcs, t) in chunks(1:N, nt)
        tmp[t] = sum(i->f(i, x...), idcs)
    end
    return [sum(tmp)]
end

(One could avoid the ChunkSplitters dependency by using e.g. Iterators.partition.)

williamfgc commented 9 months ago

Thanks @carstenbauer for pointing this out. You're the expert so please chime in anytime. We'll restart some of the work as we come back from the holidays.