Threaded Assembly Performance Degradation

termi-official commented 2 years ago

Currently threaded assembly does not scale to more than 3 cores on any machine I tried and I cannot figure out why. For the measurement I have modified threaded_assembly.jl to also utilize LinuxPerf.jl.

Here some measurements on a machine with 16 (32) threads

~/Tools/julia-1.8.2/bin/julia --threads 2 --project src/literate/threaded_assembly.jl                               ✔  22s  
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Thread #1 (TID = 294685)
┌ cpu-cycles               3.83e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  2.42e+07  100.0%  #  0.6% of cycles
└ stalled-cycles-backend   2.79e+09  100.0%  # 72.7% of cycles
┌ instructions             7.18e+09  100.0%  #  1.9 insns per cycle
│ branch-instructions      4.41e+08  100.0%  #  6.1% of insns
└ branch-misses            2.61e+06  100.0%  #  0.6% of branch insns
┌ task-clock               1.33e+09  100.0%  #  1.3 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              9.96e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #2 (TID = 294695)
┌ cpu-cycles               3.79e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  2.30e+07  100.0%  #  0.6% of cycles
└ stalled-cycles-backend   2.74e+09  100.0%  # 72.4% of cycles
┌ instructions             7.16e+09  100.0%  #  1.9 insns per cycle
│ branch-instructions      4.39e+08  100.0%  #  6.1% of insns
└ branch-misses            2.54e+06  100.0%  #  0.6% of branch insns
┌ task-clock               1.31e+09  100.0%  #  1.3 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.84e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Aggregated
┌ cpu-cycles               7.62e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  4.73e+07  100.0%  #  0.6% of cycles
└ stalled-cycles-backend   5.53e+09  100.0%  # 72.5% of cycles
┌ instructions             1.43e+10  100.0%  #  1.9 insns per cycle
│ branch-instructions      8.80e+08  100.0%  #  6.1% of insns
└ branch-misses            5.15e+06  100.0%  #  0.6% of branch insns
┌ task-clock               2.64e+09  100.0%  #  2.6 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.18e+03  100.0%
                  aggregated from 2 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
1.353969 seconds (28.03 k allocations: 3.264 MiB)

 ~/Tools/julia-1.8.2/bin/julia --threads 4 --project src/literate/threaded_assembly.jl                               ✔  24s  
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Thread #1 (TID = 294544)
┌ cpu-cycles               2.00e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  1.21e+07  100.0%  #  0.6% of cycles
└ stalled-cycles-backend   1.46e+09  100.0%  # 73.3% of cycles
┌ instructions             3.64e+09  100.0%  #  1.8 insns per cycle
│ branch-instructions      2.27e+08  100.0%  #  6.2% of insns
└ branch-misses            1.33e+06  100.0%  #  0.6% of branch insns
┌ task-clock               6.94e+08  100.0%  # 694.4 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.19e+03  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #2 (TID = 294546)
┌ cpu-cycles               1.92e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  1.09e+07  100.0%  #  0.6% of cycles
└ stalled-cycles-backend   1.41e+09  100.0%  # 73.2% of cycles
┌ instructions             3.58e+09  100.0%  #  1.9 insns per cycle
│ branch-instructions      2.20e+08  100.0%  #  6.1% of insns
└ branch-misses            1.24e+06  100.0%  #  0.6% of branch insns
┌ task-clock               6.65e+08  100.0%  # 665.0 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              4.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #3 (TID = 294547)
┌ cpu-cycles               1.97e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  1.62e+07  100.0%  #  0.8% of cycles
└ stalled-cycles-backend   1.43e+09  100.0%  # 73.0% of cycles
┌ instructions             3.58e+09  100.0%  #  1.8 insns per cycle
│ branch-instructions      2.20e+08  100.0%  #  6.1% of insns
└ branch-misses            1.25e+06  100.0%  #  0.6% of branch insns
┌ task-clock               6.83e+08  100.0%  # 682.8 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              2.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #4 (TID = 294548)
┌ cpu-cycles               1.98e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  1.22e+07  100.0%  #  0.6% of cycles
└ stalled-cycles-backend   1.47e+09  100.0%  # 73.9% of cycles
┌ instructions             3.58e+09  100.0%  #  1.8 insns per cycle
│ branch-instructions      2.19e+08  100.0%  #  6.1% of insns
└ branch-misses            1.24e+06  100.0%  #  0.6% of branch insns
┌ task-clock               6.89e+08  100.0%  # 688.7 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              2.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Aggregated
┌ cpu-cycles               7.87e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  5.13e+07  100.0%  #  0.7% of cycles
└ stalled-cycles-backend   5.77e+09  100.0%  # 73.3% of cycles
┌ instructions             1.44e+10  100.0%  #  1.8 insns per cycle
│ branch-instructions      8.86e+08  100.0%  #  6.2% of insns
└ branch-misses            5.07e+06  100.0%  #  0.6% of branch insns
┌ task-clock               2.73e+09  100.0%  #  2.7 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.20e+03  100.0%
                  aggregated from 4 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
0.789631 seconds (55.97 k allocations: 4.494 MiB)

  ~/Tools/julia-1.8.2/bin/julia --threads 8 --project src/literate/threaded_assembly.jl                                          ✔ 
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Thread #1 (TID = 294295)
┌ cpu-cycles               1.46e+09  100.0%  #  2.7 cycles per ns
│ stalled-cycles-frontend  5.46e+06  100.0%  #  0.4% of cycles
└ stalled-cycles-backend   1.16e+09  100.0%  # 79.5% of cycles
┌ instructions             1.91e+09  100.0%  #  1.3 insns per cycle
│ branch-instructions      1.28e+08  100.0%  #  6.7% of insns
└ branch-misses            8.05e+05  100.0%  #  0.6% of branch insns
┌ task-clock               5.31e+08  100.0%  # 530.5 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              3.30e+04  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #2 (TID = 294297)
┌ cpu-cycles               1.20e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  6.26e+06  100.0%  #  0.5% of cycles
└ stalled-cycles-backend   9.25e+08  100.0%  # 77.3% of cycles
┌ instructions             1.79e+09  100.0%  #  1.5 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            6.62e+05  100.0%  #  0.6% of branch insns
┌ task-clock               4.19e+08  100.0%  # 419.4 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              5.73e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #3 (TID = 294298)
┌ cpu-cycles               1.21e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  6.30e+06  100.0%  #  0.5% of cycles
└ stalled-cycles-backend   9.35e+08  100.0%  # 77.5% of cycles
┌ instructions             1.79e+09  100.0%  #  1.5 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            6.60e+05  100.0%  #  0.6% of branch insns
┌ task-clock               4.19e+08  100.0%  # 418.7 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              5.72e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #4 (TID = 294299)
┌ cpu-cycles               1.24e+09  100.0%  #  2.8 cycles per ns
│ stalled-cycles-frontend  6.68e+06  100.0%  #  0.5% of cycles
└ stalled-cycles-backend   9.61e+08  100.0%  # 77.4% of cycles
┌ instructions             1.79e+09  100.0%  #  1.4 insns per cycle
│ branch-instructions      1.09e+08  100.0%  #  6.1% of insns
└ branch-misses            6.76e+05  100.0%  #  0.6% of branch insns
┌ task-clock               4.35e+08  100.0%  # 435.5 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              5.95e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #5 (TID = 294300)
┌ cpu-cycles               1.19e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  5.29e+06  100.0%  #  0.4% of cycles
└ stalled-cycles-backend   9.25e+08  100.0%  # 77.7% of cycles
┌ instructions             1.79e+09  100.0%  #  1.5 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            6.59e+05  100.0%  #  0.6% of branch insns
┌ task-clock               4.17e+08  100.0%  # 417.5 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              5.59e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #6 (TID = 294301)
┌ cpu-cycles               1.27e+09  100.0%  #  2.8 cycles per ns
│ stalled-cycles-frontend  5.86e+06  100.0%  #  0.5% of cycles
└ stalled-cycles-backend   9.99e+08  100.0%  # 79.0% of cycles
┌ instructions             1.79e+09  100.0%  #  1.4 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            6.52e+05  100.0%  #  0.6% of branch insns
┌ task-clock               4.44e+08  100.0%  # 444.4 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.32e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #7 (TID = 294302)
┌ cpu-cycles               1.28e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  6.34e+06  100.0%  #  0.5% of cycles
└ stalled-cycles-backend   1.01e+09  100.0%  # 79.1% of cycles
┌ instructions             1.79e+09  100.0%  #  1.4 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            6.57e+05  100.0%  #  0.6% of branch insns
┌ task-clock               4.43e+08  100.0%  # 443.0 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              5.65e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #8 (TID = 294303)
┌ cpu-cycles               1.31e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  6.40e+06  100.0%  #  0.5% of cycles
└ stalled-cycles-backend   1.04e+09  100.0%  # 79.1% of cycles
┌ instructions             1.79e+09  100.0%  #  1.4 insns per cycle
│ branch-instructions      1.09e+08  100.0%  #  6.1% of insns
└ branch-misses            6.63e+05  100.0%  #  0.6% of branch insns
┌ task-clock               4.56e+08  100.0%  # 455.8 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              2.43e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Aggregated
┌ cpu-cycles               1.01e+10  100.0%  #  2.8 cycles per ns
│ stalled-cycles-frontend  4.86e+07  100.0%  #  0.5% of cycles
└ stalled-cycles-backend   7.95e+09  100.0%  # 78.4% of cycles
┌ instructions             1.44e+10  100.0%  #  1.4 insns per cycle
│ branch-instructions      8.96e+08  100.0%  #  6.2% of insns
└ branch-misses            5.43e+06  100.0%  #  0.6% of branch insns
┌ task-clock               3.56e+09  100.0%  #  3.6 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              3.62e+04  100.0%
                  aggregated from 8 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
0.609788 seconds (111.85 k allocations: 6.954 MiB)

Eliminating the calls to assemble!, reinit! and shape_* does not increase scalability. Also increasing the work load by replacing the linear problem with the assembly from the hyperelastic (i.e. NeoHookean) example does not significantly increase scalability.

Happy for any suggestions what possible points of failure could be.

Source Code

```julia using Ferrite, SparseArrays using LinuxPerf function create_example_2d_grid() grid = generate_grid(Quadrilateral, (10, 10), Vec{2}((0.0, 0.0)), Vec{2}((10.0, 10.0))) colors_workstream = create_coloring(grid; alg=ColoringAlgorithm.WorkStream) colors_greedy = create_coloring(grid; alg=ColoringAlgorithm.Greedy) vtk_grid("colored", grid) do vtk vtk_cell_data_colors(vtk, colors_workstream, "workstream-coloring") vtk_cell_data_colors(vtk, colors_greedy, "greedy-coloring") end end create_example_2d_grid(); # ![](coloring.png) # # *Figure 1*: Element coloring using the "workstream"-algorithm (left) and the "greedy"- # algorithm (right). # ## Cantilever beam in 3D with threaded assembly # We will now look at an example where we assemble the stiffness matrix using multiple # threads. We set up a simple grid and create a coloring, then create a DofHandler, # and define the material stiffness # #### Grid for the beam function create_colored_cantilever_grid(celltype, n) grid = generate_grid(celltype, (10*n, n, n), Vec{3}((0.0, 0.0, 0.0)), Vec{3}((10.0, 1.0, 1.0))) colors = create_coloring(grid) return grid, colors end; # #### DofHandler function create_dofhandler(grid::Grid{dim}) where {dim} dh = DofHandler(grid) push!(dh, :u, dim) # Add a displacement field close!(dh) end; # ### Stiffness tensor for linear elasticity function create_stiffness(::Val{dim}) where {dim} E = 200e9 ν = 0.3 λ = E*ν / ((1+ν) * (1 - 2ν)) μ = E / (2(1+ν)) δ(i,j) = i == j ? 1.0 : 0.0 g(i,j,k,l) = λ*δ(i,j)*δ(k,l) + μ*(δ(i,k)*δ(j,l) + δ(i,l)*δ(j,k)) C = SymmetricTensor{4, dim}(g); return C end; # ## Threaded data structures # # ScratchValues is a thread-local collection of data that each thread needs to own, # since we need to be able to mutate the data in the threads independently struct ScratchValues{T, CV <: CellValues, FV <: FaceValues, TT <: AbstractTensor, dim, Ti} Ke::Matrix{T} fe::Vector{T} cellvalues::CV facevalues::FV global_dofs::Vector{Int} ɛ::Vector{TT} coordinates::Vector{Vec{dim, T}} assembler::Ferrite.AssemblerSparsityPattern{T, Ti} end; # Each thread need its own CellValues and FaceValues (although, for this example we don't use # the FaceValues) function create_values(refshape, dim, order::Int) ## Interpolations and values interpolation_space = Lagrange{dim, refshape, 1}() quadrature_rule = QuadratureRule{dim, refshape}(order) face_quadrature_rule = QuadratureRule{dim-1, refshape}(order) cellvalues = [CellVectorValues(quadrature_rule, interpolation_space) for i in 1:Threads.nthreads()]; facevalues = [FaceVectorValues(face_quadrature_rule, interpolation_space) for i in 1:Threads.nthreads()]; return cellvalues, facevalues end; # Create a `ScratchValues` for each thread with the thread local data function create_scratchvalues(K, f, dh::DofHandler{dim}) where {dim} nthreads = Threads.nthreads() assemblers = [start_assemble(K, f) for i in 1:nthreads] cellvalues, facevalues = create_values(RefCube, dim, 2) n_basefuncs = getnbasefunctions(cellvalues[1]) global_dofs = [zeros(Int, ndofs_per_cell(dh)) for i in 1:nthreads] fes = [zeros(n_basefuncs) for i in 1:nthreads] # Local force vector Kes = [zeros(n_basefuncs, n_basefuncs) for i in 1:nthreads] ɛs = [[zero(SymmetricTensor{2, dim}) for i in 1:n_basefuncs] for i in 1:nthreads] coordinates = [[zero(Vec{dim}) for i in 1:length(dh.grid.cells[1].nodes)] for i in 1:nthreads] return [ScratchValues(Kes[i], fes[i], cellvalues[i], facevalues[i], global_dofs[i], ɛs[i], coordinates[i], assemblers[i]) for i in 1:nthreads] end; # ## Threaded assemble # The assembly function loops over each color and does a threaded assembly for that color function doassemble(K::SparseMatrixCSC, colors, grid::Grid, dh::DofHandler, C::SymmetricTensor{4, dim}) where {dim} f = zeros(ndofs(dh)) scratches = create_scratchvalues(K, f, dh) b = Vec{3}((0.0, 0.0, 0.0)) # Body force for color in colors ## Each color is safe to assemble threaded Threads.@threads for i in 1:length(color) assemble_cell!(scratches[Threads.threadid()], color[i], K, grid, dh, C, b) end end return K, f end # The cell assembly function is written the same way as if it was a single threaded example. # The only difference is that we unpack the variables from our `scratch`. function assemble_cell!(scratch::ScratchValues, cell::Int, K::SparseMatrixCSC, grid::Grid, dh::DofHandler, C::SymmetricTensor{4, dim}, b::Vec{dim}) where {dim} ## Unpack our stuff from the scratch Ke, fe, cellvalues, facevalues, global_dofs, ɛ, coordinates, assembler = scratch.Ke, scratch.fe, scratch.cellvalues, scratch.facevalues, scratch.global_dofs, scratch.ɛ, scratch.coordinates, scratch.assembler fill!(Ke, 0) fill!(fe, 0) n_basefuncs = getnbasefunctions(cellvalues) ## Fill up the coordinates nodeids = grid.cells[cell].nodes for j in 1:length(coordinates) coordinates[j] = grid.nodes[nodeids[j]].x end reinit!(cellvalues, coordinates) for q_point in 1:getnquadpoints(cellvalues) for i in 1:n_basefuncs ɛ[i] = symmetric(shape_gradient(cellvalues, q_point, i)) end dΩ = getdetJdV(cellvalues, q_point) for i in 1:n_basefuncs δu = shape_value(cellvalues, q_point, i) fe[i] += (δu ⋅ b) * dΩ ɛC = ɛ[i] ⊡ C for j in 1:n_basefuncs Ke[i, j] += (ɛC ⊡ ɛ[j]) * dΩ end end end celldofs!(global_dofs, dh, cell) assemble!(assembler, global_dofs, fe, Ke) end; function run_assemble() refshape = RefCube quadrature_order = 2 dim = 3 n = 20 grid, colors = create_colored_cantilever_grid(Hexahedron, n); dh = create_dofhandler(grid); K = create_sparsity_pattern(dh); C = create_stiffness(Val{3}()); ## compilation doassemble(K, colors, grid, dh, C); stats = @pstats doassemble(K, colors, grid, dh, C); LinuxPerf.printsummary(stats, expandthreads = true) b = @elapsed @time K, f = doassemble(K, colors, grid, dh, C); return b end ```

TODOs

[ ] revisit TODO https://github.com/Ferrite-FEM/Ferrite.jl/pull/232
[ ] revisit special FunctionValues for the shared memory case

termi-official commented 8 months ago

@termi-official and @KnutAM : Is there enough novelty in your investigations of parallel matrix assembly for a paper? If so, would you be interested in writing something? Let me know your thoughts please at pkrysl@ucsd.edu. I look forward to it. P

The short answer is: No, there is not even incremental research happening right now.

We are merely trying to reproduce a subset of the results from the WorkStream paper and investigate bottlenecks in our implementation (since our implementation underperformed for some reason I cannot fully explain when I opened the thread). There is nothing novel happening here which is not already described in literature about multithreaded assembly. And I think with the recent work from the CEED the most important scalability problems for multithreaded assembly are solved anyway (which I currently try to reproduce). But thanks for the offer @PetrKryslUCSD , I appreciate it!

PetrKryslUCSD commented 8 months ago

Understood. Thanks.

Could you by any chance point to the work from CEED? Thanks.

termi-official commented 8 months ago

I think a good start is https://doi.org/10.1016/j.parco.2021.102841 . A more exhaustive list should be here https://ceed.exascaleproject.org/pubs/ .

PetrKryslUCSD commented 8 months ago

An additional data point: FinEtools assembly only, 64-core Opteron machine with 1, 2, 4, 8, 16, 64 threads:

julia> 64.96466  ./ [35.713732, 15.687828, 9.211306, 4.647433, 2.38525, 1.358766]
6-element Vector{Float64}:
 1.81904e+00
 4.14109e+00
 7.05271e+00
 1.39786e+01
 2.72360e+01
 4.78115e+01

PetrKryslUCSD commented 8 months ago

And I think with the recent work from the CEED the most important scalability problems for multithreaded assembly are solved anyway (which I currently try to reproduce).

I must be missing something. The paper you linked does not talk about threading (if you do not consider GPU computing that). Did you have in mind a different paper?

PetrKryslUCSD commented 7 months ago

@termi-official Ping...

termi-official commented 7 months ago

And I think with the recent work from the CEED the most important scalability problems for multithreaded assembly are solved anyway (which I currently try to reproduce).

I must be missing something. The paper you linked does not talk about threading (if you do not consider GPU computing that). Did you have in mind a different paper?

GPU parallelism is basically thread parallelism. The paper give an overview with quite a few references where you can dive deeper. Also see e.g. Fig 7&8 for some benchmarks where throughput is measured, which can serve as a proxy for scalability.

PetrKryslUCSD commented 7 months ago

I think their solution is really not to build a matrix at all. So, good, but not a silver bullet...

On Wed, Mar 6, 2024, 12:17 PM Dennis Ogiermann @.***> wrote:

And I think with the recent work from the CEED the most important scalability problems for multithreaded assembly are solved anyway (which I currently try to reproduce).

I must be missing something. The paper you linked does not talk about threading (if you do not consider GPU computing that). Did you have in mind a different paper?

GPU parallelism is basically thread parallelism. The paper give an overview with quite a few references where you can dive deeper. Also see e.g. Fig 7&8 for some benchmarks where throughput is measured, which can serve as a proxy for scalability.

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/Ferrite-FEM/Ferrite.jl/issues/526*issuecomment-1981708828__;Iw!!Mih3wA!CKldlCzNop0RBME4XXZyq3zS4KGAO3yzBxIBP7_aZMgJgtLopibXRRNip5SvRFuRn_jkH2miD3w9LhXL8b02IWKo$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ACLGGWAYXZ7YO3C4OSD6UEDYW52UHAVCNFSM6AAAAAARKEVDIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBRG4YDQOBSHA__;!!Mih3wA!CKldlCzNop0RBME4XXZyq3zS4KGAO3yzBxIBP7_aZMgJgtLopibXRRNip5SvRFuRn_jkH2miD3w9LhXL8bH-RMZM$ . You are receiving this because you were mentioned.Message ID: @.***>

Ferrite-FEM / Ferrite.jl

Threaded Assembly Performance Degradation #526

TODOs