Open KnutAM opened 1 year ago
Hi,
one can use the Channel
object in Julia for storing the ScratchValues
object from the threaded assembley example. This way the issues with nthreads()
and threadid()
explained in the first footnote of the blog-post that you shared would be solved.
What follows is a sketch of the threaded assembley example, when one uses the design guidelines from the blogpost, i.e. tasks shall not depend on threads. Using channels, the number of ScratchValue
objects is independent from the number of threads (or tasks), because if a Channel is empty, then the task will simply wait for the Channel to fill up again.
function doassemble(K::SparseMatrixCSC, colors, grid::Grid, dh::DofHandler, C::SymmetricTensor{4, dim}) where {dim}
f = zeros(ndofs(dh))
b = Vec{3}((0.0, 0.0, 0.0)) # Body force
# the number of objects, that should fill the channel
n_obj = 8 # could also be Threads.nthreads(), or whatever
# create the channel
scratches = Channel{ScratchValues}(n_obj)
# fill the channel (assuming, the constructor only creates a single object)
for _ in 1:n_obj
put!(scratches,create_scratchvalue(K,f,dh))
end
for color in colors
tasks = map(color) do cellid
Threads.@spawn begin
# take what you want
scratch = take!(scratches)
assemble_cell!(scratch, cellid, K, grid, dh, C, b)
# but you have to give it back
put!(scratches,scratch)
end
end
# be patient! you have to let the tasks run to the end, otherwise your timing benchmarks will be marvelous, but you will not get the correct result
wait.(tasks)
end
return K, f
end
Of course there is always the performance question. Here are some benchmarks from my own implementation. I used 8 Threads with a cube of 8^3 linear elements which translates to 2187 dofs. ~Sadly github screws up the benchmark histograms, so I couldn't post them~. To me, the performance looks pretty much equivalent, this will not be a bottleneck for me.
This is the assembley using channels
BenchmarkTools.Trial: 14 samples with 1 evaluation.
Range (min … max): 332.687 ms … 432.290 ms ┊ GC (min … max): 47.88% … 65.26%
Time (median): 353.018 ms ┊ GC (median): 58.12%
Time (mean ± σ): 359.041 ms ± 22.796 ms ┊ GC (mean ± σ): 58.05% ± 3.56%
▁ █ ▁██ ▁ ▁█ ▁ ▁
█▁▁▁▁▁▁▁█▁███▁█▁▁██▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
333 ms Histogram: frequency by time 432 ms <
Memory estimate: 437.39 MiB, allocs estimate: 15408784.
This is the current assembley
BenchmarkTools.Trial: 14 samples with 1 evaluation.
Range (min … max): 168.072 ms … 403.001 ms ┊ GC (min … max): 0.00% … 64.18%
Time (median): 380.939 ms ┊ GC (median): 61.88%
Time (mean ± σ): 367.882 ms ± 58.063 ms ┊ GC (mean ± σ): 60.23% ± 16.66%
▁█▄ ▁
▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁███▆▁█▁▆ ▁
168 ms Histogram: frequency by time 403 ms <
Memory estimate: 436.49 MiB, allocs estimate: 15404044.
Disregard this proposal as you please Kind regards
If one compares the minimum, the performance is notably worse for some reason.
PS: you can post the benchmark as a "code" block that preserves the formatting. You just have to wrap the stuff in three of those ``` before and after
Thanks for your suggestion. Using Channels is a nice solution! A problem is that the scaling is not very good (in my benchmarks) for this approach (I believe because each put! and take! locks so all other tasks have to wait if there is a queue) One solution (that I use in other places) is another loop, where you loop over a chunk of elements without taking and putting from/to the Channel. But this code adds a bit of overhead code, that makes it less readable for users not so familiar with julia/multithreading.
It is also possible to only spawn fewer tasks and let each task keep their own buffer, and then use chunks via the channels approach, see fredriks branch here: https://github.com/Ferrite-FEM/Ferrite.jl/blob/0e3f4d20faec741e5da05a2f9e60c67910b8ef3a/docs/src/literate/threaded_assembly.jl
One option would be to keep the "how-to" as it currently is using :static
, but warn that it is not "the best way", and doesn't allow nested threaded tasks. And then provide something as you have written, but with chunking (perhaps with ChunkSplitters) as an advanced threading example that we can refer to. Feel free to open a PR if you are interested in working on that! (Could also make sense to homogenize the two, to allow most functions to be re-used, for example, the create_scratchvalues function) and all the setup)
Hi,
so I did a benchmark of the assembley of a box with 8x8x8 linear hexahedral elements. The nice thing is, that each color has exactly 64 cells (using the greedy algorithm). I used [1,2,4,8,16,32,64] threads. I do not understand, why the time gets worse at 16 threads.
Just want to highlight again that the metric maybe matters here:
Our results suggest that using the minimum estimator for the true run time of a benchmark, rather than the mean or median, is robust to nonideal statistics and also provides the smallest error. Our model also revealed some behaviors that challenge conventional wisdom: simply running a benchmark for longer, or repeating its execution many times, can render the effects of external variation negligible, even as the error due to timer inaccuracy is amortized.
from https://arxiv.org/pdf/1608.04295.pdf
Which is already true for the first textual benchmarks you showed, so maybe its even worse than that
I do not understand, why the time gets worse at 16 threads.
Would be interesting to see the results for a larger mesh, since, as you say, there are only 64 cells in each color, this gives only 4 cells per thread (for 16). I would try 30x30x30, and perhaps just use the regular @elapsed
as the running time is long enough.
Thanks for your suggestion. Using Channels is a nice solution! A problem is that the scaling is not very good (in my benchmarks) for this approach (I believe because each put! and take! locks so all other tasks have to wait if there is a queue) One solution (that I use in other places) is another loop, where you loop over a chunk of elements without taking and putting from/to the Channel. But this code adds a bit of overhead code, that makes it less readable for users not so familiar with julia/multithreading.
It is also possible to only spawn fewer tasks and let each task keep their own buffer, and then use chunks via the channels approach, see fredriks branch here: https://github.com/Ferrite-FEM/Ferrite.jl/blob/0e3f4d20faec741e5da05a2f9e60c67910b8ef3a/docs/src/literate/threaded_assembly.jl
One option would be to keep the "how-to" as it currently is using
:static
, but warn that it is not "the best way", and doesn't allow nested threaded tasks. And then provide something as you have written, but with chunking (perhaps with ChunkSplitters) as an advanced threading example that we can refer to. Feel free to open a PR if you are interested in working on that! (Could also make sense to homogenize the two, to allow most functions to be re-used, for example, the create_scratchvalues function) and all the setup)
Hi,
the chunk-loop approach sounds interessting; worth considering. Since my usecase consists of some heavy lifting in each element, small overhead in the design of the threaded assembley is manageable. I just wanted to implement threaded assembley according to julia-design standards, inspired by this, your, issue.
Now considering the threaded assembly example: For a new user, any form of threading is way better than none; especially if it is simple. Channels get rid of the issues, mentioned in the blog post you shared, without introducing a new dependency.
Many greetings
I find this pattern pretty easy to understand too: https://github.com/Ferrite-FEM/Ferrite.jl/blob/0e3f4d20faec741e5da05a2f9e60c67910b8ef3a/docs/src/literate/threaded_assembly.jl#L127-L142
Just want to highlight again that the metric maybe matters here:
Our results suggest that using the minimum estimator for the true run time of a benchmark, rather than the mean or median, is robust to nonideal statistics and also provides the smallest error. Our model also revealed some behaviors that challenge conventional wisdom: simply running a benchmark for longer, or repeating its execution many times, can render the effects of external variation negligible, even as the error due to timer inaccuracy is amortized.
from https://arxiv.org/pdf/1608.04295.pdf
Which is already true for the first textual benchmarks you showed, so maybe its even worse than that
This is the same benchmark as before, but using the minimum runtime, instead of the median. This looks much nicer, I have to say.
I do not understand, why the time gets worse at 16 threads.
Would be interesting to see the results for a larger mesh, since, as you say, there are only 64 cells in each color, this gives only 4 cells per thread (for 16). I would try 30x30x30, and perhaps just use the regular
@elapsed
as the running time is long enough.
Here comes the requested benchmark. This time having the single-time-measurement statistics. Looks pretty awful.
However, the degree of awfulness in runtim is a moot point, since both versions (using :static and using channels, respectively) behave similarly. My proposal for threaded assembley is thus finalized. Decide however you want, I simply felt like reporting a solution, since this thread made me think about my own implementation.
Thanks to everyone, partaking in the discussion.
Another thing that came into my mind was whether it might be beneficial to use atomics over the coloring.
The threaded assembly in our example is not the recommended way to do things anymore (it is still correct though, AFAIU).
See PSA: Thread-local state is no longer recommended
One option could be to use ChunkSplitters.jl.