Closed termi-official closed 8 months ago
@termi-official and @KnutAM : Is there enough novelty in your investigations of parallel matrix assembly for a paper? If so, would you be interested in writing something? Let me know your thoughts please at pkrysl@ucsd.edu. I look forward to it. P
The short answer is: No, there is not even incremental research happening right now.
We are merely trying to reproduce a subset of the results from the WorkStream paper and investigate bottlenecks in our implementation (since our implementation underperformed for some reason I cannot fully explain when I opened the thread). There is nothing novel happening here which is not already described in literature about multithreaded assembly. And I think with the recent work from the CEED the most important scalability problems for multithreaded assembly are solved anyway (which I currently try to reproduce). But thanks for the offer @PetrKryslUCSD , I appreciate it!
Understood. Thanks.
Could you by any chance point to the work from CEED? Thanks.
I think a good start is https://doi.org/10.1016/j.parco.2021.102841 . A more exhaustive list should be here https://ceed.exascaleproject.org/pubs/ .
An additional data point: FinEtools assembly only, 64-core Opteron machine with 1, 2, 4, 8, 16, 64 threads:
julia> 64.96466 ./ [35.713732, 15.687828, 9.211306, 4.647433, 2.38525, 1.358766]
6-element Vector{Float64}:
1.81904e+00
4.14109e+00
7.05271e+00
1.39786e+01
2.72360e+01
4.78115e+01
And I think with the recent work from the CEED the most important scalability problems for multithreaded assembly are solved anyway (which I currently try to reproduce).
I must be missing something. The paper you linked does not talk about threading (if you do not consider GPU computing that). Did you have in mind a different paper?
@termi-official Ping...
And I think with the recent work from the CEED the most important scalability problems for multithreaded assembly are solved anyway (which I currently try to reproduce).
I must be missing something. The paper you linked does not talk about threading (if you do not consider GPU computing that). Did you have in mind a different paper?
GPU parallelism is basically thread parallelism. The paper give an overview with quite a few references where you can dive deeper. Also see e.g. Fig 7&8 for some benchmarks where throughput is measured, which can serve as a proxy for scalability.
I think their solution is really not to build a matrix at all. So, good, but not a silver bullet...
On Wed, Mar 6, 2024, 12:17 PM Dennis Ogiermann @.***> wrote:
And I think with the recent work from the CEED the most important scalability problems for multithreaded assembly are solved anyway (which I currently try to reproduce).
I must be missing something. The paper you linked does not talk about threading (if you do not consider GPU computing that). Did you have in mind a different paper?
GPU parallelism is basically thread parallelism. The paper give an overview with quite a few references where you can dive deeper. Also see e.g. Fig 7&8 for some benchmarks where throughput is measured, which can serve as a proxy for scalability.
— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/Ferrite-FEM/Ferrite.jl/issues/526*issuecomment-1981708828__;Iw!!Mih3wA!CKldlCzNop0RBME4XXZyq3zS4KGAO3yzBxIBP7_aZMgJgtLopibXRRNip5SvRFuRn_jkH2miD3w9LhXL8b02IWKo$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ACLGGWAYXZ7YO3C4OSD6UEDYW52UHAVCNFSM6AAAAAARKEVDIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBRG4YDQOBSHA__;!!Mih3wA!CKldlCzNop0RBME4XXZyq3zS4KGAO3yzBxIBP7_aZMgJgtLopibXRRNip5SvRFuRn_jkH2miD3w9LhXL8bH-RMZM$ . You are receiving this because you were mentioned.Message ID: @.***>
Currently threaded assembly does not scale to more than 3 cores on any machine I tried and I cannot figure out why. For the measurement I have modified threaded_assembly.jl to also utilize LinuxPerf.jl.
Here some measurements on a machine with 16 (32) threads
Eliminating the calls to
assemble!
,reinit!
andshape_*
does not increase scalability. Also increasing the work load by replacing the linear problem with the assembly from the hyperelastic (i.e. NeoHookean) example does not significantly increase scalability.Happy for any suggestions what possible points of failure could be.
Source Code
TODOs