Closed carstenbauer closed 1 year ago
I fail to see how the workload would be more evenly distributed for
nchunks > nthreads()
. Already fornchunks == nthreads()
does every thread have the same workload.
First, clearly you demonstrate that the example there is flawed relative to what I wanted to show. So it needs to be fixed, of course. Thank you for pointing that out.
What I was thinking when building that example is that the workload of each chunk may be unevenly distributed by the nature of the workload (not by anything related to the threading model).
I tried to obtain some example that shows what I wanted with @threads
, but it ended up having to be way too artificial. You are correct, the reasoning I was doing is valid only for @spawn/@sync
, for which a greater a number of spawned tasks can use the fact that the threads can occupy idle threads.
I will be fixing the documentation asap. Thanks.
Let me see if I get this right now.
I have these functions which receive a very artificially unbalanced workload, defined by an input workload
vector for each task:
julia> function uneven_workload_threads(x, workload; nchunks=nthreads(), chunk_type=:batch)
s = fill(zero(eltype(x)), nchunks)
@threads for (xrange, ichunk) in chunks(n_workload, nchunks, chunk_type)
for i in xrange
s[ichunk] += sum(log(x[j])^7 for j in 1:workload[i])
end
end
return sum(s)
end
uneven_workload_threads (generic function with 1 method)
julia> function uneven_workload_spawn(x, workload; nchunks=nthreads(), chunk_type=:batch)
s = fill(zero(eltype(x)), nchunks)
@sync for (xrange, ichunk) in chunks(n_workload, nchunks, chunk_type)
@spawn for i in xrange
s[ichunk] += sum(log(x[j])^7 for j in 1:workload[i])
end
end
return sum(s)
end
uneven_workload_spawn (generic function with 1 method)
We create a very unbalanced workload, with:
julia> x = rand(10^6); work = collect(i <= 8 ? 10^4 : 1 for i in 1:64);
And we get, then, using nchunks = nthreads()
:
julia> @btime uneven_workload_threads($x, $n_workload; nchunks=8, chunk_type=:batch)
3.624 ms (47 allocations: 4.70 KiB)
-3.096516397309112e8
julia> @btime uneven_workload_threads($x, $n_workload; nchunks=8, chunk_type=:scatter)
729.888 μs (47 allocations: 4.70 KiB)
-3.096516397309112e8
julia> @btime uneven_workload_spawn($x, $n_workload; nchunks=8, chunk_type=:batch)
3.597 ms (93 allocations: 6.27 KiB)
-3.096516397309112e8
julia> @btime uneven_workload_spawn($x, $n_workload; nchunks=8, chunk_type=:scatter)
736.788 μs (96 allocations: 6.61 KiB)
-3.096516397309112e8
Such that it is possible to deal with the unbalaced workload with the :scatter
option, since there is, here, a correlation between chunk index and workload. However, if that is not known, one can deal with the workload by increasing the number of chunks, if using the @sync/@spawn
option:
julia> @btime uneven_workload_spawn($x, $n_workload; nchunks=64, chunk_type=:batch)
845.804 μs (659 allocations: 47.98 KiB)
-3.096516397309112e8
But the same does not work if using @threads
:
julia> @btime uneven_workload_threads($x, $n_workload; nchunks=64, chunk_type=:batch)
3.619 ms (46 allocations: 5.11 KiB)
-3.096516397309112e8
Do you agree with this? I think I will replace the example docs with these examples.
I'll take a closer look soon, but I think that a balanced workload (that you have in the original example that is currently in the docs) is also worth mentioning.
Here is what the docs are looking like. I kept the other example before, just removed everything associated with the varying number of chunks there. And added this:
Here we define two functions which (artificially) result in very uneven workload distributions
among tasks. Basically, we sum log(x[i])^7
for x[i]
being the elements of an array. However,
each task has to sum a different number of elements, defined in a workload
vector. The
workload vector will have 64
tasks, where the first 8
tasks will perform the sum for 10^4
elements of the x
array, and the other tasks will sum only 5 elements.
The functions are defined using Threads.@threads
or Threads.@sync/Threads.@spawn
macros of
base julia, which imply different possibilities of load balancing.
The functions are:
julia> using Base.Threads, ChunkSplitters
julia> function uneven_workload_threads(x, work_load; nchunks::Int, chunk_type::Symbol)
s = fill(zero(eltype(x)), nchunks)
@threads for (xrange, ichunk) in chunks(work_load, nchunks, chunk_type)
for i in xrange
s[ichunk] += sum(log(x[j])^7 for j in 1:work_load[i])
end
end
return sum(s)
end
julia> function uneven_workload_spawn(x, work_load; nchunks::Int, chunk_type::Symbol)
s = fill(zero(eltype(x)), nchunks)
@sync for (xrange, ichunk) in chunks(work_load, nchunks, chunk_type)
@spawn for i in xrange
s[ichunk] += sum(log(x[j])^7 for j in 1:work_load[i])
end
end
return sum(s)
end
We create a very unbalanced workload, with:
julia> x = rand(10^6); work_load = collect(i <= 8 ? 10^4 : 1 for i in 1:64);
julia> using UnicodePlots
julia> lineplot(work_load; xlabel="task", ylabel="workload")
┌────────────────────────────────────────┐
10 000 │⠈⠉⠉⠉⢹⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
workload │⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠈⡆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
0 │⠀⠀⠀⠀⠀⣇⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⠀⠀⠀│
└────────────────────────────────────────┘
⠀0⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀70⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀task⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
Using nchunks == nthreads()
, which are 8
in this case, we get the following timings:
julia> @btime uneven_workload_threads($x, $work_load; nchunks=8, chunk_type=:batch)
3.671 ms (43 allocations: 4.52 KiB)
-1.121686418772953e9
julia> @btime uneven_workload_threads($x, $work_load; nchunks=8, chunk_type=:scatter)
728.946 μs (47 allocations: 4.64 KiB)
-1.121686418772953e9
julia> @btime uneven_workload_spawn($x, $work_load; nchunks=8, chunk_type=:batch)
3.604 ms (59 allocations: 5.08 KiB)
-1.121686418772953e9
julia> @btime uneven_workload_spawn($x, $work_load; nchunks=8, chunk_type=:scatter)
737.583 μs (62 allocations: 5.17 KiB)
-1.121686418772953e9
Therefore, it is possible to deal with the unbalaced workload with the :scatter
option, since there is, here, a correlation between chunk index and workload. However, if that is not known, one can deal with the workload by increasing the number of chunks, if using the @sync/@spawn
option:
julia> @btime uneven_workload_spawn($x, $work_load; nchunks=64, chunk_type=:batch)
820.537 μs (397 allocations: 38.80 KiB)
-1.1216864187729511e9
But the same does not work if using @threads
, because the first 8
tasks will noneless be assigned to the same thread:
julia> @btime uneven_workload_threads($x, $work_load; nchunks=64, chunk_type=:batch)
3.689 ms (43 allocations: 4.95 KiB)
-1.1216864187729511e9
Frankly, I must say that you've made the discussion somewhat more complicated by now mixing
@threads
vs @sync @spawn
:batch
vs :scatter
I'll try to work myself through all of this but this will take more time now.
(Also note that I would have chosen a continuously decreasing/increasing workload instead of a pretty special/biased step function that has its step precisely at the number of threads.)
General comment up-front: I assume that false sharing is less of an issue in the following because threads are writing to s
at different times (due to the uneven workload). So essentially we are now talking about effects not mentioned in the OP.
I chose a linearly increasing workload, which makes understanding the results pretty straightforward.
work_load = floor.(Int, range(start=1, stop=10^4, length=64))
┌────────────────────────────────────────┐
10 000 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠖⠉⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠊⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠔⠁⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠔⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠊⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠖⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠊⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
workload │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠔⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠔⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠊⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠖⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⢀⡠⠊⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⢀⠔⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⡠⠔⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
0 │⢀⡠⠊⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
└────────────────────────────────────────┘
⠀0⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀70⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀task⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
Findings
@threads
:
:batch
nchunks=8
and nchunks=64
.:scatter
:batch
, performance is much better for nchunks=nthreads()
.:batch
, the first few threads get all the cheap work wheras the latter threads get all the expensive work (everyone gets the same number of work items). With :scatter
we distribute the workload more evenly: each thread gets a piece of the workload along the linear increase.nchunks==threads()
is better than (the limit) nchunks == nworkitems >> nthreads()
(where nworkitems = length(workload)
)nchunks == nworkitems
, each chunk consists of only a single work item. Therefore the "load balancing feature" of :scatter
(for this workload) isn't working anymore. (Try nchunks==nworkitems/2
, i.e. 32 for the explicit case here and you'll find that the speedup compared to :batch
is about (t_batch64 - t_scatter8) / (t_batch64 / t_scatter8)
. In my test this was the case within ~2%.)@spawn
:
:batch
@spawn
+ :batch
+ nchunks==nthreads()
≈ @threads
+:batch
+ nchunks==nthreads()
@spawn
load-balancing, which operates on tasks, is thus effectively disabled, which is why @spawn
essentially becomes @thread
.@spawn
+ :batch
+ nchunks == nworkitems >> nthreads()
<≈ @threads
+:scatter
+ nchunks==nthreads()
@spawn
load-balancing. In the latter, we do the load-balancing "manually" via :scatter
(see above). However, for arbitrary/unknown workloads, @spawn
will generally perform better (think of a pathological workload that has all the workload on every other or every nthreads()
s workitem).:scatter
@spawn
+ :scatter
+ nchunks == nworkitems >>nthreads()
<≈ @spawn
+ :scatter
+ nchunks == nthreads()
@spawn
is effectively disabled and thus we see a performance identical to @threads
+ :scatter
+ nchunks == nthreads()
. However, this performance is good (see above) because we implement "manual load balancing" through :scatter
. In the former case, :scatter
becomes irrelevant (because chunks only contain a single work item) and the load-balancing is done by @spawn
. Thus, we see comparable performance, slightly in favor of @spawn
.TLDR:
In combination with @threads
, :scatter
can be a "poor man's load-balancing". But @spawn
gives slightly better and more general load-balancing (if one gives it enough chunks to balance...). Increasing the number of chunks only helps with @spawn
and only really with @spawn
+ :batch
. Again, one needs to give it enough chunks to balance... It does not help with @threads
because that doesn't load-balance anyway.
I have updated following your previous comments, and released the new docs:
https://m3g.github.io/ChunkSplitters.jl/v1.0/
Now I'll take a look at these new observations. If you want to contribute something directly there, you're welcome, of course.
edit: I agree with all your observations above, and the docs now I think more or less reflect that. I just don't mention the tuning of nchunks @threads + :scatter
you suggest above, maybe that's an unnecessary complication.
Frankly, I'm running out of steam and, while I don't want to sound harsh, I don't feel that my comments are valued properly. It took me quite some effort to write down the study above and you already update the docs without waiting for my (announced) comments. Also, you immediately and completely deviated from the point that I made in the OP (which was false-sharing). (No offense, just trying to communicate my impression here.)
In any case, I think the paragraphs in the new docs aren't particularly clear and can still be improved. I would also add a sentence along the following lines to the (original) even-workload example with @threads
:
"Note that increasing the number of chunks beyond
nthreads()
gives better performance for the above example. However, this is due to more subtle effects (false-sharing) and not related to the chunking and the distribution of work among threads. For well-designed parallel algorithms,nchunks == nthreads()
should be optimal in conjuction with@threads
."
But apart from this, I will step back from this now (and maybe come back on a different day).
Well, I'm sorry you got that impression. Since your comment I have been doing experiments to address the issue, and I wrote that new version of the docs before you released that detailed account of the observations, because I wanted to remove the incorrect benchmark that was there as soon as possible.
At the same time I was feeding my 2 month baby and taking care of my 4 years old kid while my dog is dying. So, please, don't take it personally.
Concerning your points:
1) False sharing. I completely bought your argument, and that's why I searched for a different example where that could be less of an issue. My goal has been to provide an example where :scatter
and :batch
, and possibly the use of nchunks > nthreads()
could be useful for better load balancing. I didn't got the impression that I should keep the example where false sharing can be an issue. So I removed the benchmarks with nchunks > nthreads()
in that case.
2) Your note about false sharing, suggested here, is nice, as I understand as a final note for the discussion on load balancing. I'll add it now.
You brought two important issues here: the example where I was trying to demonstrate the use of nchunks > nthreads()
was wrong conceptually; and that using that when the parallelization is done is @threads
is of no use. The observations concerning the workload distribution with :scatter
and :batch
are what expect (that's why those options exist). It is not clear to me if their utility is not clear to start with and if you are suggesting that a detailed explanation of the effects, as you did, would be useful for the docs.
Anyway, I'm open to suggestions, and if you anytime feel like contributing, of course you'll be mostly welcome.
At the same time I was feeding my 2 month baby and taking care of my 4 years old kid while my dog is dying. So, please, don't take it personally.
No worries, I'm not taking it personally. Congratulations for your newborn and condolences for your dog. I wish you all the best!
- False sharing. I completely bought your argument, and that's why I searched for a different example where that could be less of an issue. My goal has been to provide an example where
:scatter
and:batch
, and possibly the use ofnchunks > nthreads()
could be useful for better load balancing. I didn't got the impression that I should keep the example where false sharing can be an issue. So I removed the benchmarks withnchunks > nthreads()
in that case.
I don't think there needs to be a false sharing example (although it is a rather common performance mistake), but I think there should be an example with even workload because it is 1) a common case and 2) rather easy to reason about.
Under 2.: It is not clear to me if their utility is not clear to start with and if you are suggesting that a detailed explanation of the effects, as you did, would be useful for the docs.
The detailed exploration is probably too much for the docs (or, if presented, it should be "optional", e.g. behind a (un)foldable element or similar). But I think that something along the lines of my "TLDR" above would benefit the docs. Currently, the overall message about what :scatter
and nchunks > nthreads()
is doing is a little bit "hidden" by the multiple paragraphs and multiple examples/benchmarks.
On a more general note, you are conflating ideas and terminology a little bit, which makes it harder for the reader to understand what's going on. For example, right at the start you show this image while talking about :batch
and :scatter
:
However, labelling the right columns as "Threads" is imprecise/misleading as that implicitly assumes a static, ordered distribution of (sticky) tasks to threads. The image is only really correct when chunks
is used in conjunction with @threads :static
. For @threads :dynamic
, we can't say that the first chunk is going to be processed by the first Julia thread etc. and for @spawn
non of the assumptions are fulfilled. I suggest that you label the left columns as "work items" - "task" already has a different meaning in Julia - and the right columns as "chunks" to avoid the chunk->task->thread mapping story entirely at this early stage. After all, fundamentally, the function chunks
has nothing to do with either tasks or threads. You can than later show this image (or a slightly modified one) for the case where chunks
is used in conjuction with @threads :static
or @threads :dynamic
.
I have more concrete suggestions for the paragraphs, but I think it is easier if I just make a PR for those (will do soon).
In the docs you show a practical example. A minor thing I wanted to point out is that you're missing
using BenchmarkTools
there. Otherwise the example isn't runnable.But what made me open this issue is the following explanatory sentence:
I fail to see how the workload would be more evenly distributed for
nchunks > nthreads()
. Already fornchunks == nthreads()
does every thread have the same workload. Of course, you clearly see better timings fornchunks > nthreads()
. However, I'd argue that this is likely false sharing. Note that increasing the number of chunks can probably mitigate false sharing partially as thes
array that stores the intermediate per-thread sums is bigger and threads are less likely to access the same cache line (i.e. less cache invalidations).To test this theory, I introduced a variant of the parallel sum where I've intentionally padded the
s
array with zeros to ensure that threads don't modify the same cache line. (The cache line size on my system is 64 bytes)These are the timings that I see (I pinned the threads with ThreadPinning just to be safe).
Regular sum:
Padded sum:
In agreement with my expectations, my padded sum isn't just faster then the original implementation but also
nchunks == nthreads()
isn't slower thannchunks > nthreads()
. (One could probably check the decrease in the number of cache invalidations with LinuxPerf.jl or LIKWID.jl)If you agree with the above, you should consider changing your documentation.