Update threading example

KnutAM commented 1 year ago

The threaded assembly in our example is not the recommended way to do things anymore (it is still correct though, AFAIU).

See PSA: Thread-local state is no longer recommended

One option could be to use ChunkSplitters.jl.

Firimo commented 1 year ago

Hi,

one can use the Channel object in Julia for storing the ScratchValues object from the threaded assembley example. This way the issues with nthreads() and threadid() explained in the first footnote of the blog-post that you shared would be solved.

What follows is a sketch of the threaded assembley example, when one uses the design guidelines from the blogpost, i.e. tasks shall not depend on threads. Using channels, the number of ScratchValue objects is independent from the number of threads (or tasks), because if a Channel is empty, then the task will simply wait for the Channel to fill up again.

function doassemble(K::SparseMatrixCSC, colors, grid::Grid, dh::DofHandler, C::SymmetricTensor{4, dim}) where {dim}

    f = zeros(ndofs(dh))
    b = Vec{3}((0.0, 0.0, 0.0)) # Body force

    # the number of objects, that should fill the channel
    n_obj = 8 # could also be Threads.nthreads(), or whatever

    # create the channel
    scratches = Channel{ScratchValues}(n_obj)

    # fill the channel (assuming, the constructor only creates a single object)
    for _ in 1:n_obj
         put!(scratches,create_scratchvalue(K,f,dh))
    end

    for color in colors
        tasks = map(color) do cellid
            Threads.@spawn begin
                # take what you want
                scratch = take!(scratches)

                assemble_cell!(scratch, cellid, K, grid, dh, C, b)

                # but you have to give it back
                put!(scratches,scratch)
            end
        end

        # be patient! you have to let the tasks run to the end, otherwise your timing benchmarks will be marvelous, but you will not get the correct result
        wait.(tasks)
    end

    return K, f
end

Of course there is always the performance question. Here are some benchmarks from my own implementation. I used 8 Threads with a cube of 8^3 linear elements which translates to 2187 dofs. ~Sadly github screws up the benchmark histograms, so I couldn't post them~. To me, the performance looks pretty much equivalent, this will not be a bottleneck for me.

This is the assembley using channels

BenchmarkTools.Trial: 14 samples with 1 evaluation.
 Range (min … max):  332.687 ms … 432.290 ms  ┊ GC (min … max): 47.88% … 65.26%
 Time  (median):     353.018 ms               ┊ GC (median):    58.12%
 Time  (mean ± σ):   359.041 ms ±  22.796 ms  ┊ GC (mean ± σ):  58.05% ±  3.56%

  ▁       █ ▁██ ▁  ▁█ ▁                                       ▁
  █▁▁▁▁▁▁▁█▁███▁█▁▁██▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  333 ms           Histogram: frequency by time          432 ms <

 Memory estimate: 437.39 MiB, allocs estimate: 15408784.

This is the current assembley

BenchmarkTools.Trial: 14 samples with 1 evaluation.
 Range (min … max):  168.072 ms … 403.001 ms  ┊ GC (min … max):  0.00% … 64.18%
 Time  (median):     380.939 ms               ┊ GC (median):    61.88%
 Time  (mean ± σ):   367.882 ms ±  58.063 ms  ┊ GC (mean ± σ):  60.23% ± 16.66%

                                                       ▁█▄  ▁
  ▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁███▆▁█▁▆ ▁
  168 ms           Histogram: frequency by time          403 ms <

 Memory estimate: 436.49 MiB, allocs estimate: 15404044.

Disregard this proposal as you please Kind regards

koehlerson commented 1 year ago

If one compares the minimum, the performance is notably worse for some reason.

PS: you can post the benchmark as a "code" block that preserves the formatting. You just have to wrap the stuff in three of those ``` before and after

KnutAM commented 1 year ago

Thanks for your suggestion. Using Channels is a nice solution! A problem is that the scaling is not very good (in my benchmarks) for this approach (I believe because each put! and take! locks so all other tasks have to wait if there is a queue) One solution (that I use in other places) is another loop, where you loop over a chunk of elements without taking and putting from/to the Channel. But this code adds a bit of overhead code, that makes it less readable for users not so familiar with julia/multithreading.

It is also possible to only spawn fewer tasks and let each task keep their own buffer, and then use chunks via the channels approach, see fredriks branch here: https://github.com/Ferrite-FEM/Ferrite.jl/blob/0e3f4d20faec741e5da05a2f9e60c67910b8ef3a/docs/src/literate/threaded_assembly.jl

One option would be to keep the "how-to" as it currently is using :static, but warn that it is not "the best way", and doesn't allow nested threaded tasks. And then provide something as you have written, but with chunking (perhaps with ChunkSplitters) as an advanced threading example that we can refer to. Feel free to open a PR if you are interested in working on that! (Could also make sense to homogenize the two, to allow most functions to be re-used, for example, the create_scratchvalues function) and all the setup)

Firimo commented 1 year ago

Hi,

so I did a benchmark of the assembley of a box with 8x8x8 linear hexahedral elements. The nice thing is, that each color has exactly 64 cells (using the greedy algorithm). I used [1,2,4,8,16,32,64] threads. I do not understand, why the time gets worse at 16 threads.

threaded_benchmark

koehlerson commented 1 year ago

Just want to highlight again that the metric maybe matters here:

Our results suggest that using the minimum estimator for the true run time of a benchmark, rather than the mean or median, is robust to nonideal statistics and also provides the smallest error. Our model also revealed some behaviors that challenge conventional wisdom: simply running a benchmark for longer, or repeating its execution many times, can render the effects of external variation negligible, even as the error due to timer inaccuracy is amortized.

from https://arxiv.org/pdf/1608.04295.pdf

Which is already true for the first textual benchmarks you showed, so maybe its even worse than that

KnutAM commented 1 year ago

I do not understand, why the time gets worse at 16 threads.

Would be interesting to see the results for a larger mesh, since, as you say, there are only 64 cells in each color, this gives only 4 cells per thread (for 16). I would try 30x30x30, and perhaps just use the regular @elapsed as the running time is long enough.

Firimo commented 1 year ago

Thanks for your suggestion. Using Channels is a nice solution! A problem is that the scaling is not very good (in my benchmarks) for this approach (I believe because each put! and take! locks so all other tasks have to wait if there is a queue) One solution (that I use in other places) is another loop, where you loop over a chunk of elements without taking and putting from/to the Channel. But this code adds a bit of overhead code, that makes it less readable for users not so familiar with julia/multithreading.

It is also possible to only spawn fewer tasks and let each task keep their own buffer, and then use chunks via the channels approach, see fredriks branch here: https://github.com/Ferrite-FEM/Ferrite.jl/blob/0e3f4d20faec741e5da05a2f9e60c67910b8ef3a/docs/src/literate/threaded_assembly.jl

One option would be to keep the "how-to" as it currently is using :static, but warn that it is not "the best way", and doesn't allow nested threaded tasks. And then provide something as you have written, but with chunking (perhaps with ChunkSplitters) as an advanced threading example that we can refer to. Feel free to open a PR if you are interested in working on that! (Could also make sense to homogenize the two, to allow most functions to be re-used, for example, the create_scratchvalues function) and all the setup)

Hi,

the chunk-loop approach sounds interessting; worth considering. Since my usecase consists of some heavy lifting in each element, small overhead in the design of the threaded assembley is manageable. I just wanted to implement threaded assembley according to julia-design standards, inspired by this, your, issue.

Now considering the threaded assembly example: For a new user, any form of threading is way better than none; especially if it is simple. Channels get rid of the issues, mentioned in the blog post you shared, without introducing a new dependency.

Many greetings

fredrikekre commented 1 year ago

I find this pattern pretty easy to understand too: https://github.com/Ferrite-FEM/Ferrite.jl/blob/0e3f4d20faec741e5da05a2f9e60c67910b8ef3a/docs/src/literate/threaded_assembly.jl#L127-L142

Firimo commented 1 year ago

Just want to highlight again that the metric maybe matters here:

Our results suggest that using the minimum estimator for the true run time of a benchmark, rather than the mean or median, is robust to nonideal statistics and also provides the smallest error. Our model also revealed some behaviors that challenge conventional wisdom: simply running a benchmark for longer, or repeating its execution many times, can render the effects of external variation negligible, even as the error due to timer inaccuracy is amortized.

from https://arxiv.org/pdf/1608.04295.pdf

Which is already true for the first textual benchmarks you showed, so maybe its even worse than that

This is the same benchmark as before, but using the minimum runtime, instead of the median. This looks much nicer, I have to say.

threaded_benchmark_8x8x8_min

Firimo commented 1 year ago

I do not understand, why the time gets worse at 16 threads.

Would be interesting to see the results for a larger mesh, since, as you say, there are only 64 cells in each color, this gives only 4 cells per thread (for 16). I would try 30x30x30, and perhaps just use the regular @elapsed as the running time is long enough.

Here comes the requested benchmark. This time having the single-time-measurement statistics. Looks pretty awful.

threaded_benchmark_30x30x30

Firimo commented 1 year ago

However, the degree of awfulness in runtim is a moot point, since both versions (using :static and using channels, respectively) behave similarly. My proposal for threaded assembley is thus finalized. Decide however you want, I simply felt like reporting a solution, since this thread made me think about my own implementation.

Thanks to everyone, partaking in the discussion.

termi-official commented 11 months ago

Another thing that came into my mind was whether it might be beneficial to use atomics over the coloring.

Ferrite-FEM / Ferrite.jl

Update threading example #760