Improve/correct practical example section in docs

In the docs you show a practical example. A minor thing I wanted to point out is that you're missing using BenchmarkTools there. Otherwise the example isn't runnable.

But what made me open this issue is the following explanatory sentence:

Note that it is possible that nchunks > nthreads() is optimal, since that will distribute the workload more evenly among available threads.

I fail to see how the workload would be more evenly distributed for nchunks > nthreads(). Already for nchunks == nthreads() does every thread have the same workload. Of course, you clearly see better timings for nchunks > nthreads(). However, I'd argue that this is likely false sharing. Note that increasing the number of chunks can probably mitigate false sharing partially as the s array that stores the intermediate per-thread sums is bigger and threads are less likely to access the same cache line (i.e. less cache invalidations).

To test this theory, I introduced a variant of the parallel sum where I've intentionally padded the s array with zeros to ensure that threads don't modify the same cache line. (The cache line size on my system is 64 bytes)

function sum_parallel_padded(f, x; nchunks=Threads.nthreads())
    s = fill(zero(eltype(x)), nchunks*100)
    Threads.@threads for (xrange, ichunk) in chunks(x, nchunks)
        for i in xrange
            s[((ichunk-1)*100 + 1)] += f(x[i])
        end
    end
    return sum(s)
end

These are the timings that I see (I pinned the threads with ThreadPinning just to be safe).

Regular sum:

julia> @btime sum(x -> log(x)^7, $x)
  127.574 ms (0 allocations: 0 bytes)
-5.0711112944955185e10

julia> @btime sum_parallel(x -> log(x)^7, $x; nchunks=4)
  97.924 ms (69 allocations: 6.94 KiB)
-5.07111129449084e10

julia> @btime sum_parallel(x -> log(x)^7, $x; nchunks=12)
  96.199 ms (74 allocations: 7.16 KiB)
-5.0711112944942474e10

julia> @btime sum_parallel(x -> log(x)^7, $x; nchunks=64)
  21.434 ms (74 allocations: 7.56 KiB)
-5.071111294495303e10

Padded sum:

julia> @btime sum_parallel_padded(x -> log(x)^7, $x; nchunks=4)
  34.380 ms (69 allocations: 7.23 KiB)
-5.0711112944908394e10

julia> @btime sum_parallel_padded(x -> log(x)^7, $x; nchunks=12)
  13.571 ms (74 allocations: 8.06 KiB)
-5.071111294494246e10

julia> @btime sum_parallel_padded(x -> log(x)^7, $x; nchunks=64)
  15.276 ms (75 allocations: 12.16 KiB)
-5.071111294495302e10

In agreement with my expectations, my padded sum isn't just faster then the original implementation but also nchunks == nthreads() isn't slower than nchunks > nthreads(). (One could probably check the decrease in the number of cache invalidations with LinuxPerf.jl or LIKWID.jl)

If you agree with the above, you should consider changing your documentation.

I fail to see how the workload would be more evenly distributed for nchunks > nthreads(). Already for nchunks == nthreads() does every thread have the same workload.

First, clearly you demonstrate that the example there is flawed relative to what I wanted to show. So it needs to be fixed, of course. Thank you for pointing that out.

What I was thinking when building that example is that the workload of each chunk may be unevenly distributed by the nature of the workload (not by anything related to the threading model).

I tried to obtain some example that shows what I wanted with @threads, but it ended up having to be way too artificial. You are correct, the reasoning I was doing is valid only for @spawn/@sync, for which a greater a number of spawned tasks can use the fact that the threads can occupy idle threads.

I will be fixing the documentation asap. Thanks.

Let me see if I get this right now.

I have these functions which receive a very artificially unbalanced workload, defined by an input workload vector for each task:

julia> function uneven_workload_threads(x, workload; nchunks=nthreads(), chunk_type=:batch)
           s = fill(zero(eltype(x)), nchunks)
           @threads for (xrange, ichunk) in chunks(n_workload, nchunks, chunk_type)
               for i in xrange
                   s[ichunk] += sum(log(x[j])^7 for j in 1:workload[i]) 
               end
           end
           return sum(s)
       end
uneven_workload_threads (generic function with 1 method)

julia> function uneven_workload_spawn(x, workload; nchunks=nthreads(), chunk_type=:batch)
           s = fill(zero(eltype(x)), nchunks)
           @sync for (xrange, ichunk) in chunks(n_workload, nchunks, chunk_type)
               @spawn for i in xrange
                   s[ichunk] += sum(log(x[j])^7 for j in 1:workload[i]) 
               end
           end
           return sum(s)
       end
uneven_workload_spawn (generic function with 1 method)

We create a very unbalanced workload, with:

julia> x = rand(10^6); work = collect(i <= 8 ? 10^4 : 1 for i in 1:64);

And we get, then, using nchunks = nthreads():

julia> @btime uneven_workload_threads($x, $n_workload; nchunks=8, chunk_type=:batch)
  3.624 ms (47 allocations: 4.70 KiB)
-3.096516397309112e8

julia> @btime uneven_workload_threads($x, $n_workload; nchunks=8, chunk_type=:scatter)
  729.888 μs (47 allocations: 4.70 KiB)
-3.096516397309112e8

julia> @btime uneven_workload_spawn($x, $n_workload; nchunks=8, chunk_type=:batch)
  3.597 ms (93 allocations: 6.27 KiB)
-3.096516397309112e8

julia> @btime uneven_workload_spawn($x, $n_workload; nchunks=8, chunk_type=:scatter)
  736.788 μs (96 allocations: 6.61 KiB)
-3.096516397309112e8

Such that it is possible to deal with the unbalaced workload with the :scatter option, since there is, here, a correlation between chunk index and workload. However, if that is not known, one can deal with the workload by increasing the number of chunks, if using the @sync/@spawn option:

julia> @btime uneven_workload_spawn($x, $n_workload; nchunks=64, chunk_type=:batch)
  845.804 μs (659 allocations: 47.98 KiB)
-3.096516397309112e8

But the same does not work if using @threads:

julia> @btime uneven_workload_threads($x, $n_workload; nchunks=64, chunk_type=:batch)
  3.619 ms (46 allocations: 5.11 KiB)
-3.096516397309112e8

Do you agree with this? I think I will replace the example docs with these examples.

I'll take a closer look soon, but I think that a balanced workload (that you have in the original example that is currently in the docs) is also worth mentioning.

Here is what the docs are looking like. I kept the other example before, just removed everything associated with the varying number of chunks there. And added this:

Load balancing considerations

Here we define two functions which (artificially) result in very uneven workload distributions among tasks. Basically, we sum log(x[i])^7 for x[i] being the elements of an array. However, each task has to sum a different number of elements, defined in a workload vector. The workload vector will have 64 tasks, where the first 8 tasks will perform the sum for 10^4 elements of the x array, and the other tasks will sum only 5 elements.

The functions are defined using Threads.@threads or Threads.@sync/Threads.@spawn macros of base julia, which imply different possibilities of load balancing.

The functions are:

julia> using Base.Threads, ChunkSplitters

julia> function uneven_workload_threads(x, work_load; nchunks::Int, chunk_type::Symbol)
           s = fill(zero(eltype(x)), nchunks)
           @threads for (xrange, ichunk) in chunks(work_load, nchunks, chunk_type)
               for i in xrange
                   s[ichunk] += sum(log(x[j])^7 for j in 1:work_load[i]) 
               end
           end
           return sum(s)
       end

julia> function uneven_workload_spawn(x, work_load; nchunks::Int, chunk_type::Symbol)
           s = fill(zero(eltype(x)), nchunks)
           @sync for (xrange, ichunk) in chunks(work_load, nchunks, chunk_type)
               @spawn for i in xrange
                   s[ichunk] += sum(log(x[j])^7 for j in 1:work_load[i]) 
               end
           end
           return sum(s)
       end

We create a very unbalanced workload, with:

julia> x = rand(10^6); work_load = collect(i <= 8 ? 10^4 : 1 for i in 1:64);

julia> using UnicodePlots

julia> lineplot(work_load; xlabel="task", ylabel="workload")
                   ┌────────────────────────────────────────┐ 
            10 000 │⠈⠉⠉⠉⢹⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
                   │⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
                   │⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
                   │⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
                   │⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
                   │⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
                   │⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
   workload        │⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
                   │⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
                   │⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
                   │⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
                   │⠀⠀⠀⠀⠈⡆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
                   │⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
                   │⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
                 0 │⠀⠀⠀⠀⠀⣇⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⠀⠀⠀│ 
                   └────────────────────────────────────────┘ 
                   ⠀0⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀70⠀ 
                   ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀task⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀

Using nchunks == nthreads(), which are 8 in this case, we get the following timings:

julia> @btime uneven_workload_threads($x, $work_load; nchunks=8, chunk_type=:batch)
  3.671 ms (43 allocations: 4.52 KiB)
-1.121686418772953e9

julia> @btime uneven_workload_threads($x, $work_load; nchunks=8, chunk_type=:scatter)
  728.946 μs (47 allocations: 4.64 KiB)
-1.121686418772953e9

julia> @btime uneven_workload_spawn($x, $work_load; nchunks=8, chunk_type=:batch)
  3.604 ms (59 allocations: 5.08 KiB)
-1.121686418772953e9

julia> @btime uneven_workload_spawn($x, $work_load; nchunks=8, chunk_type=:scatter)
  737.583 μs (62 allocations: 5.17 KiB)
-1.121686418772953e9

Therefore, it is possible to deal with the unbalaced workload with the :scatter option, since there is, here, a correlation between chunk index and workload. However, if that is not known, one can deal with the workload by increasing the number of chunks, if using the @sync/@spawn option:

julia> @btime uneven_workload_spawn($x, $work_load; nchunks=64, chunk_type=:batch)
  820.537 μs (397 allocations: 38.80 KiB)
-1.1216864187729511e9

But the same does not work if using @threads, because the first 8 tasks will noneless be assigned to the same thread:

julia> @btime uneven_workload_threads($x, $work_load; nchunks=64, chunk_type=:batch)
  3.689 ms (43 allocations: 4.95 KiB)
-1.1216864187729511e9

Frankly, I must say that you've made the discussion somewhat more complicated by now mixing

even vs uneven workload
@threads vs @sync @spawn
:batch vs :scatter
number of chunks

I'll try to work myself through all of this but this will take more time now.

(Also note that I would have chosen a continuously decreasing/increasing workload instead of a pretty special/biased step function that has its step precisely at the number of threads.)

General comment up-front: I assume that false sharing is less of an issue in the following because threads are writing to s at different times (due to the uneven workload). So essentially we are now talking about effects not mentioned in the OP.

I chose a linearly increasing workload, which makes understanding the results pretty straightforward.

work_load = floor.(Int, range(start=1, stop=10^4, length=64))

                   ┌────────────────────────────────────────┐ 
            10 000 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠖⠉⠀⠀⠀│ 
                   │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠊⠀⠀⠀⠀⠀⠀│ 
                   │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠔⠁⠀⠀⠀⠀⠀⠀⠀⠀│ 
                   │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠔⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
                   │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠊⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
                   │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠖⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
                   │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠊⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
   workload        │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠔⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
                   │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠔⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
                   │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠊⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
                   │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠖⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
                   │⠀⠀⠀⠀⠀⠀⠀⢀⡠⠊⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
                   │⠀⠀⠀⠀⠀⢀⠔⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
                   │⠀⠀⠀⡠⠔⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
                 0 │⢀⡠⠊⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
                   └────────────────────────────────────────┘ 
                   ⠀0⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀70⠀ 
                   ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀task⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀

Findings

@threads:
- :batch
- Observation: Essentially the same performance for nchunks=8 and nchunks=64.
- Reason: no load balancing (the work is distributed to the threads up-front, i.e. statically).
- :scatter
- Observation 1: Compared to :batch, performance is much better for nchunks=nthreads().
- Reason: With :batch, the first few threads get all the cheap work wheras the latter threads get all the expensive work (everyone gets the same number of work items). With :scatter we distribute the workload more evenly: each thread gets a piece of the workload along the linear increase.
- Observation 2: nchunks==threads() is better than (the limit) nchunks == nworkitems >> nthreads() (where nworkitems = length(workload))
- Reason: For nchunks == nworkitems, each chunk consists of only a single work item. Therefore the "load balancing feature" of :scatter (for this workload) isn't working anymore. (Try nchunks==nworkitems/2, i.e. 32 for the explicit case here and you'll find that the speedup compared to :batch is about (t_batch64 - t_scatter8) / (t_batch64 / t_scatter8). In my test this was the case within ~2%.)
@spawn:
- :batch
- Observation 1: @spawn + :batch + nchunks==nthreads() ≈ @threads +:batch + nchunks==nthreads()
- Reason: In both cases, each thread gets a single chunk/task of different weight (uneven workload). The @spawn load-balancing, which operates on tasks, is thus effectively disabled, which is why @spawn essentially becomes @thread.
- Observation 2: @spawn + :batch + nchunks == nworkitems >> nthreads() <≈ @threads +:scatter + nchunks==nthreads()
- Reason: In the former case, we take full opportunity of @spawn load-balancing. In the latter, we do the load-balancing "manually" via :scatter (see above). However, for arbitrary/unknown workloads, @spawn will generally perform better (think of a pathological workload that has all the workload on every other or every nthreads()s workitem).
- :scatter
- Observation: @spawn + :scatter + nchunks == nworkitems >>nthreads() <≈ @spawn + :scatter + nchunks == nthreads()
- Reason: In the latter case, load-balancing due to @spawn is effectively disabled and thus we see a performance identical to @threads + :scatter + nchunks == nthreads(). However, this performance is good (see above) because we implement "manual load balancing" through :scatter. In the former case, :scatter becomes irrelevant (because chunks only contain a single work item) and the load-balancing is done by @spawn. Thus, we see comparable performance, slightly in favor of @spawn.

TLDR: In combination with @threads, :scatter can be a "poor man's load-balancing". But @spawn gives slightly better and more general load-balancing (if one gives it enough chunks to balance...). Increasing the number of chunks only helps with @spawn and only really with @spawn + :batch. Again, one needs to give it enough chunks to balance... It does not help with @threads because that doesn't load-balance anyway.

I have updated following your previous comments, and released the new docs:

https://m3g.github.io/ChunkSplitters.jl/v1.0/

Now I'll take a look at these new observations. If you want to contribute something directly there, you're welcome, of course.

edit: I agree with all your observations above, and the docs now I think more or less reflect that. I just don't mention the tuning of nchunks @threads + :scatter you suggest above, maybe that's an unnecessary complication.

Frankly, I'm running out of steam and, while I don't want to sound harsh, I don't feel that my comments are valued properly. It took me quite some effort to write down the study above and you already update the docs without waiting for my (announced) comments. Also, you immediately and completely deviated from the point that I made in the OP (which was false-sharing). (No offense, just trying to communicate my impression here.)

In any case, I think the paragraphs in the new docs aren't particularly clear and can still be improved. I would also add a sentence along the following lines to the (original) even-workload example with @threads:

"Note that increasing the number of chunks beyond nthreads() gives better performance for the above example. However, this is due to more subtle effects (false-sharing) and not related to the chunking and the distribution of work among threads. For well-designed parallel algorithms, nchunks == nthreads() should be optimal in conjuction with @threads."

But apart from this, I will step back from this now (and maybe come back on a different day).

Well, I'm sorry you got that impression. Since your comment I have been doing experiments to address the issue, and I wrote that new version of the docs before you released that detailed account of the observations, because I wanted to remove the incorrect benchmark that was there as soon as possible.

At the same time I was feeding my 2 month baby and taking care of my 4 years old kid while my dog is dying. So, please, don't take it personally.

Concerning your points:

1) False sharing. I completely bought your argument, and that's why I searched for a different example where that could be less of an issue. My goal has been to provide an example where :scatter and :batch, and possibly the use of nchunks > nthreads() could be useful for better load balancing. I didn't got the impression that I should keep the example where false sharing can be an issue. So I removed the benchmarks with nchunks > nthreads() in that case.

2) Your note about false sharing, suggested here, is nice, as I understand as a final note for the discussion on load balancing. I'll add it now.

You brought two important issues here: the example where I was trying to demonstrate the use of nchunks > nthreads() was wrong conceptually; and that using that when the parallelization is done is @threads is of no use. The observations concerning the workload distribution with :scatter and :batch are what expect (that's why those options exist). It is not clear to me if their utility is not clear to start with and if you are suggesting that a detailed explanation of the effects, as you did, would be useful for the docs.

Anyway, I'm open to suggestions, and if you anytime feel like contributing, of course you'll be mostly welcome.

At the same time I was feeding my 2 month baby and taking care of my 4 years old kid while my dog is dying. So, please, don't take it personally.

No worries, I'm not taking it personally. Congratulations for your newborn and condolences for your dog. I wish you all the best!

False sharing. I completely bought your argument, and that's why I searched for a different example where that could be less of an issue. My goal has been to provide an example where :scatter and :batch, and possibly the use of nchunks > nthreads() could be useful for better load balancing. I didn't got the impression that I should keep the example where false sharing can be an issue. So I removed the benchmarks with nchunks > nthreads() in that case.

I don't think there needs to be a false sharing example (although it is a rather common performance mistake), but I think there should be an example with even workload because it is 1) a common case and 2) rather easy to reason about.

Under 2.: It is not clear to me if their utility is not clear to start with and if you are suggesting that a detailed explanation of the effects, as you did, would be useful for the docs.

The detailed exploration is probably too much for the docs (or, if presented, it should be "optional", e.g. behind a (un)foldable element or similar). But I think that something along the lines of my "TLDR" above would benefit the docs. Currently, the overall message about what :scatter and nchunks > nthreads() is doing is a little bit "hidden" by the multiple paragraphs and multiple examples/benchmarks.

On a more general note, you are conflating ideas and terminology a little bit, which makes it harder for the reader to understand what's going on. For example, right at the start you show this image while talking about :batch and :scatter:

However, labelling the right columns as "Threads" is imprecise/misleading as that implicitly assumes a static, ordered distribution of (sticky) tasks to threads. The image is only really correct when chunks is used in conjunction with @threads :static. For @threads :dynamic, we can't say that the first chunk is going to be processed by the first Julia thread etc. and for @spawn non of the assumptions are fulfilled. I suggest that you label the left columns as "work items" - "task" already has a different meaning in Julia - and the right columns as "chunks" to avoid the chunk->task->thread mapping story entirely at this early stage. After all, fundamentally, the function chunks has nothing to do with either tasks or threads. You can than later show this image (or a slightly modified one) for the case where chunks is used in conjuction with @threads :static or @threads :dynamic.

I have more concrete suggestions for the paragraphs, but I think it is easier if I just make a PR for those (will do soon).

JuliaFolds2 / ChunkSplitters.jl

Improve/correct practical example section in docs #8

Load balancing considerations