Threaded Assembly Performance Degradation

termi-official commented 2 years ago

Currently threaded assembly does not scale to more than 3 cores on any machine I tried and I cannot figure out why. For the measurement I have modified threaded_assembly.jl to also utilize LinuxPerf.jl.

Here some measurements on a machine with 16 (32) threads

~/Tools/julia-1.8.2/bin/julia --threads 2 --project src/literate/threaded_assembly.jl                               ✔  22s  
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Thread #1 (TID = 294685)
┌ cpu-cycles               3.83e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  2.42e+07  100.0%  #  0.6% of cycles
└ stalled-cycles-backend   2.79e+09  100.0%  # 72.7% of cycles
┌ instructions             7.18e+09  100.0%  #  1.9 insns per cycle
│ branch-instructions      4.41e+08  100.0%  #  6.1% of insns
└ branch-misses            2.61e+06  100.0%  #  0.6% of branch insns
┌ task-clock               1.33e+09  100.0%  #  1.3 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              9.96e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #2 (TID = 294695)
┌ cpu-cycles               3.79e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  2.30e+07  100.0%  #  0.6% of cycles
└ stalled-cycles-backend   2.74e+09  100.0%  # 72.4% of cycles
┌ instructions             7.16e+09  100.0%  #  1.9 insns per cycle
│ branch-instructions      4.39e+08  100.0%  #  6.1% of insns
└ branch-misses            2.54e+06  100.0%  #  0.6% of branch insns
┌ task-clock               1.31e+09  100.0%  #  1.3 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.84e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Aggregated
┌ cpu-cycles               7.62e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  4.73e+07  100.0%  #  0.6% of cycles
└ stalled-cycles-backend   5.53e+09  100.0%  # 72.5% of cycles
┌ instructions             1.43e+10  100.0%  #  1.9 insns per cycle
│ branch-instructions      8.80e+08  100.0%  #  6.1% of insns
└ branch-misses            5.15e+06  100.0%  #  0.6% of branch insns
┌ task-clock               2.64e+09  100.0%  #  2.6 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.18e+03  100.0%
                  aggregated from 2 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
1.353969 seconds (28.03 k allocations: 3.264 MiB)

 ~/Tools/julia-1.8.2/bin/julia --threads 4 --project src/literate/threaded_assembly.jl                               ✔  24s  
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Thread #1 (TID = 294544)
┌ cpu-cycles               2.00e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  1.21e+07  100.0%  #  0.6% of cycles
└ stalled-cycles-backend   1.46e+09  100.0%  # 73.3% of cycles
┌ instructions             3.64e+09  100.0%  #  1.8 insns per cycle
│ branch-instructions      2.27e+08  100.0%  #  6.2% of insns
└ branch-misses            1.33e+06  100.0%  #  0.6% of branch insns
┌ task-clock               6.94e+08  100.0%  # 694.4 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.19e+03  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #2 (TID = 294546)
┌ cpu-cycles               1.92e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  1.09e+07  100.0%  #  0.6% of cycles
└ stalled-cycles-backend   1.41e+09  100.0%  # 73.2% of cycles
┌ instructions             3.58e+09  100.0%  #  1.9 insns per cycle
│ branch-instructions      2.20e+08  100.0%  #  6.1% of insns
└ branch-misses            1.24e+06  100.0%  #  0.6% of branch insns
┌ task-clock               6.65e+08  100.0%  # 665.0 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              4.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #3 (TID = 294547)
┌ cpu-cycles               1.97e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  1.62e+07  100.0%  #  0.8% of cycles
└ stalled-cycles-backend   1.43e+09  100.0%  # 73.0% of cycles
┌ instructions             3.58e+09  100.0%  #  1.8 insns per cycle
│ branch-instructions      2.20e+08  100.0%  #  6.1% of insns
└ branch-misses            1.25e+06  100.0%  #  0.6% of branch insns
┌ task-clock               6.83e+08  100.0%  # 682.8 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              2.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #4 (TID = 294548)
┌ cpu-cycles               1.98e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  1.22e+07  100.0%  #  0.6% of cycles
└ stalled-cycles-backend   1.47e+09  100.0%  # 73.9% of cycles
┌ instructions             3.58e+09  100.0%  #  1.8 insns per cycle
│ branch-instructions      2.19e+08  100.0%  #  6.1% of insns
└ branch-misses            1.24e+06  100.0%  #  0.6% of branch insns
┌ task-clock               6.89e+08  100.0%  # 688.7 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              2.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Aggregated
┌ cpu-cycles               7.87e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  5.13e+07  100.0%  #  0.7% of cycles
└ stalled-cycles-backend   5.77e+09  100.0%  # 73.3% of cycles
┌ instructions             1.44e+10  100.0%  #  1.8 insns per cycle
│ branch-instructions      8.86e+08  100.0%  #  6.2% of insns
└ branch-misses            5.07e+06  100.0%  #  0.6% of branch insns
┌ task-clock               2.73e+09  100.0%  #  2.7 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.20e+03  100.0%
                  aggregated from 4 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
0.789631 seconds (55.97 k allocations: 4.494 MiB)

  ~/Tools/julia-1.8.2/bin/julia --threads 8 --project src/literate/threaded_assembly.jl                                          ✔ 
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Thread #1 (TID = 294295)
┌ cpu-cycles               1.46e+09  100.0%  #  2.7 cycles per ns
│ stalled-cycles-frontend  5.46e+06  100.0%  #  0.4% of cycles
└ stalled-cycles-backend   1.16e+09  100.0%  # 79.5% of cycles
┌ instructions             1.91e+09  100.0%  #  1.3 insns per cycle
│ branch-instructions      1.28e+08  100.0%  #  6.7% of insns
└ branch-misses            8.05e+05  100.0%  #  0.6% of branch insns
┌ task-clock               5.31e+08  100.0%  # 530.5 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              3.30e+04  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #2 (TID = 294297)
┌ cpu-cycles               1.20e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  6.26e+06  100.0%  #  0.5% of cycles
└ stalled-cycles-backend   9.25e+08  100.0%  # 77.3% of cycles
┌ instructions             1.79e+09  100.0%  #  1.5 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            6.62e+05  100.0%  #  0.6% of branch insns
┌ task-clock               4.19e+08  100.0%  # 419.4 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              5.73e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #3 (TID = 294298)
┌ cpu-cycles               1.21e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  6.30e+06  100.0%  #  0.5% of cycles
└ stalled-cycles-backend   9.35e+08  100.0%  # 77.5% of cycles
┌ instructions             1.79e+09  100.0%  #  1.5 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            6.60e+05  100.0%  #  0.6% of branch insns
┌ task-clock               4.19e+08  100.0%  # 418.7 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              5.72e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #4 (TID = 294299)
┌ cpu-cycles               1.24e+09  100.0%  #  2.8 cycles per ns
│ stalled-cycles-frontend  6.68e+06  100.0%  #  0.5% of cycles
└ stalled-cycles-backend   9.61e+08  100.0%  # 77.4% of cycles
┌ instructions             1.79e+09  100.0%  #  1.4 insns per cycle
│ branch-instructions      1.09e+08  100.0%  #  6.1% of insns
└ branch-misses            6.76e+05  100.0%  #  0.6% of branch insns
┌ task-clock               4.35e+08  100.0%  # 435.5 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              5.95e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #5 (TID = 294300)
┌ cpu-cycles               1.19e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  5.29e+06  100.0%  #  0.4% of cycles
└ stalled-cycles-backend   9.25e+08  100.0%  # 77.7% of cycles
┌ instructions             1.79e+09  100.0%  #  1.5 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            6.59e+05  100.0%  #  0.6% of branch insns
┌ task-clock               4.17e+08  100.0%  # 417.5 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              5.59e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #6 (TID = 294301)
┌ cpu-cycles               1.27e+09  100.0%  #  2.8 cycles per ns
│ stalled-cycles-frontend  5.86e+06  100.0%  #  0.5% of cycles
└ stalled-cycles-backend   9.99e+08  100.0%  # 79.0% of cycles
┌ instructions             1.79e+09  100.0%  #  1.4 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            6.52e+05  100.0%  #  0.6% of branch insns
┌ task-clock               4.44e+08  100.0%  # 444.4 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.32e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #7 (TID = 294302)
┌ cpu-cycles               1.28e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  6.34e+06  100.0%  #  0.5% of cycles
└ stalled-cycles-backend   1.01e+09  100.0%  # 79.1% of cycles
┌ instructions             1.79e+09  100.0%  #  1.4 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            6.57e+05  100.0%  #  0.6% of branch insns
┌ task-clock               4.43e+08  100.0%  # 443.0 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              5.65e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #8 (TID = 294303)
┌ cpu-cycles               1.31e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  6.40e+06  100.0%  #  0.5% of cycles
└ stalled-cycles-backend   1.04e+09  100.0%  # 79.1% of cycles
┌ instructions             1.79e+09  100.0%  #  1.4 insns per cycle
│ branch-instructions      1.09e+08  100.0%  #  6.1% of insns
└ branch-misses            6.63e+05  100.0%  #  0.6% of branch insns
┌ task-clock               4.56e+08  100.0%  # 455.8 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              2.43e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Aggregated
┌ cpu-cycles               1.01e+10  100.0%  #  2.8 cycles per ns
│ stalled-cycles-frontend  4.86e+07  100.0%  #  0.5% of cycles
└ stalled-cycles-backend   7.95e+09  100.0%  # 78.4% of cycles
┌ instructions             1.44e+10  100.0%  #  1.4 insns per cycle
│ branch-instructions      8.96e+08  100.0%  #  6.2% of insns
└ branch-misses            5.43e+06  100.0%  #  0.6% of branch insns
┌ task-clock               3.56e+09  100.0%  #  3.6 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              3.62e+04  100.0%
                  aggregated from 8 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
0.609788 seconds (111.85 k allocations: 6.954 MiB)

Eliminating the calls to assemble!, reinit! and shape_* does not increase scalability. Also increasing the work load by replacing the linear problem with the assembly from the hyperelastic (i.e. NeoHookean) example does not significantly increase scalability.

Happy for any suggestions what possible points of failure could be.

Source Code

```julia using Ferrite, SparseArrays using LinuxPerf function create_example_2d_grid() grid = generate_grid(Quadrilateral, (10, 10), Vec{2}((0.0, 0.0)), Vec{2}((10.0, 10.0))) colors_workstream = create_coloring(grid; alg=ColoringAlgorithm.WorkStream) colors_greedy = create_coloring(grid; alg=ColoringAlgorithm.Greedy) vtk_grid("colored", grid) do vtk vtk_cell_data_colors(vtk, colors_workstream, "workstream-coloring") vtk_cell_data_colors(vtk, colors_greedy, "greedy-coloring") end end create_example_2d_grid(); # ![](coloring.png) # # *Figure 1*: Element coloring using the "workstream"-algorithm (left) and the "greedy"- # algorithm (right). # ## Cantilever beam in 3D with threaded assembly # We will now look at an example where we assemble the stiffness matrix using multiple # threads. We set up a simple grid and create a coloring, then create a DofHandler, # and define the material stiffness # #### Grid for the beam function create_colored_cantilever_grid(celltype, n) grid = generate_grid(celltype, (10*n, n, n), Vec{3}((0.0, 0.0, 0.0)), Vec{3}((10.0, 1.0, 1.0))) colors = create_coloring(grid) return grid, colors end; # #### DofHandler function create_dofhandler(grid::Grid{dim}) where {dim} dh = DofHandler(grid) push!(dh, :u, dim) # Add a displacement field close!(dh) end; # ### Stiffness tensor for linear elasticity function create_stiffness(::Val{dim}) where {dim} E = 200e9 ν = 0.3 λ = E*ν / ((1+ν) * (1 - 2ν)) μ = E / (2(1+ν)) δ(i,j) = i == j ? 1.0 : 0.0 g(i,j,k,l) = λ*δ(i,j)*δ(k,l) + μ*(δ(i,k)*δ(j,l) + δ(i,l)*δ(j,k)) C = SymmetricTensor{4, dim}(g); return C end; # ## Threaded data structures # # ScratchValues is a thread-local collection of data that each thread needs to own, # since we need to be able to mutate the data in the threads independently struct ScratchValues{T, CV <: CellValues, FV <: FaceValues, TT <: AbstractTensor, dim, Ti} Ke::Matrix{T} fe::Vector{T} cellvalues::CV facevalues::FV global_dofs::Vector{Int} ɛ::Vector{TT} coordinates::Vector{Vec{dim, T}} assembler::Ferrite.AssemblerSparsityPattern{T, Ti} end; # Each thread need its own CellValues and FaceValues (although, for this example we don't use # the FaceValues) function create_values(refshape, dim, order::Int) ## Interpolations and values interpolation_space = Lagrange{dim, refshape, 1}() quadrature_rule = QuadratureRule{dim, refshape}(order) face_quadrature_rule = QuadratureRule{dim-1, refshape}(order) cellvalues = [CellVectorValues(quadrature_rule, interpolation_space) for i in 1:Threads.nthreads()]; facevalues = [FaceVectorValues(face_quadrature_rule, interpolation_space) for i in 1:Threads.nthreads()]; return cellvalues, facevalues end; # Create a `ScratchValues` for each thread with the thread local data function create_scratchvalues(K, f, dh::DofHandler{dim}) where {dim} nthreads = Threads.nthreads() assemblers = [start_assemble(K, f) for i in 1:nthreads] cellvalues, facevalues = create_values(RefCube, dim, 2) n_basefuncs = getnbasefunctions(cellvalues[1]) global_dofs = [zeros(Int, ndofs_per_cell(dh)) for i in 1:nthreads] fes = [zeros(n_basefuncs) for i in 1:nthreads] # Local force vector Kes = [zeros(n_basefuncs, n_basefuncs) for i in 1:nthreads] ɛs = [[zero(SymmetricTensor{2, dim}) for i in 1:n_basefuncs] for i in 1:nthreads] coordinates = [[zero(Vec{dim}) for i in 1:length(dh.grid.cells[1].nodes)] for i in 1:nthreads] return [ScratchValues(Kes[i], fes[i], cellvalues[i], facevalues[i], global_dofs[i], ɛs[i], coordinates[i], assemblers[i]) for i in 1:nthreads] end; # ## Threaded assemble # The assembly function loops over each color and does a threaded assembly for that color function doassemble(K::SparseMatrixCSC, colors, grid::Grid, dh::DofHandler, C::SymmetricTensor{4, dim}) where {dim} f = zeros(ndofs(dh)) scratches = create_scratchvalues(K, f, dh) b = Vec{3}((0.0, 0.0, 0.0)) # Body force for color in colors ## Each color is safe to assemble threaded Threads.@threads for i in 1:length(color) assemble_cell!(scratches[Threads.threadid()], color[i], K, grid, dh, C, b) end end return K, f end # The cell assembly function is written the same way as if it was a single threaded example. # The only difference is that we unpack the variables from our `scratch`. function assemble_cell!(scratch::ScratchValues, cell::Int, K::SparseMatrixCSC, grid::Grid, dh::DofHandler, C::SymmetricTensor{4, dim}, b::Vec{dim}) where {dim} ## Unpack our stuff from the scratch Ke, fe, cellvalues, facevalues, global_dofs, ɛ, coordinates, assembler = scratch.Ke, scratch.fe, scratch.cellvalues, scratch.facevalues, scratch.global_dofs, scratch.ɛ, scratch.coordinates, scratch.assembler fill!(Ke, 0) fill!(fe, 0) n_basefuncs = getnbasefunctions(cellvalues) ## Fill up the coordinates nodeids = grid.cells[cell].nodes for j in 1:length(coordinates) coordinates[j] = grid.nodes[nodeids[j]].x end reinit!(cellvalues, coordinates) for q_point in 1:getnquadpoints(cellvalues) for i in 1:n_basefuncs ɛ[i] = symmetric(shape_gradient(cellvalues, q_point, i)) end dΩ = getdetJdV(cellvalues, q_point) for i in 1:n_basefuncs δu = shape_value(cellvalues, q_point, i) fe[i] += (δu ⋅ b) * dΩ ɛC = ɛ[i] ⊡ C for j in 1:n_basefuncs Ke[i, j] += (ɛC ⊡ ɛ[j]) * dΩ end end end celldofs!(global_dofs, dh, cell) assemble!(assembler, global_dofs, fe, Ke) end; function run_assemble() refshape = RefCube quadrature_order = 2 dim = 3 n = 20 grid, colors = create_colored_cantilever_grid(Hexahedron, n); dh = create_dofhandler(grid); K = create_sparsity_pattern(dh); C = create_stiffness(Val{3}()); ## compilation doassemble(K, colors, grid, dh, C); stats = @pstats doassemble(K, colors, grid, dh, C); LinuxPerf.printsummary(stats, expandthreads = true) b = @elapsed @time K, f = doassemble(K, colors, grid, dh, C); return b end ```

TODOs

[ ] revisit TODO https://github.com/Ferrite-FEM/Ferrite.jl/pull/232
[ ] revisit special FunctionValues for the shared memory case

koehlerson commented 2 years ago

your example misses run_assemble() at the bottom and I would be very interested in this as well

I get the following:

julia --threads 2 --project src/literate/threaded_assembly.jl    
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Thread #1 (TID = 135838)
╶ cpu-cycles               2.56e+09  100.0%  #  4.2 cycles per ns
┌ instructions             7.19e+09  100.0%  #  2.8 insns per cycle
│ branch-instructions      4.37e+08  100.0%  #  6.1% of insns
└ branch-misses            1.20e+06  100.0%  #  0.3% of branch insns
┌ task-clock               6.13e+08  100.0%  # 612.8 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              7.03e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #2 (TID = 135840)
╶ cpu-cycles               3.99e+09    7.1%  #  4.1 cycles per ns
┌ instructions             1.14e+10    7.1%  #  2.9 insns per cycle
│ branch-instructions      6.95e+08    7.1%  #  6.1% of insns
└ branch-misses            2.06e+06    7.1%  #  0.3% of branch insns
┌ task-clock               9.68e+08  100.0%  # 967.6 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Aggregated
╶ cpu-cycles               6.58e+09   43.1%  #  4.2 cycles per ns
┌ instructions             1.85e+10   43.1%  #  2.8 insns per cycle
│ branch-instructions      1.13e+09   43.1%  #  6.1% of insns
└ branch-misses            3.12e+06   43.1%  #  0.3% of branch insns
┌ task-clock               1.58e+09  100.0%  #  1.6 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              7.03e+02  100.0%
                  aggregated from 2 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  1.036346 seconds (28.03 k allocations: 3.264 MiB)

julia --threads 4 --project src/literate/threaded_assembly.jl    
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Thread #1 (TID = 135863)
╶ cpu-cycles               1.45e+09  100.0%  #  3.9 cycles per ns
┌ instructions             3.61e+09  100.0%  #  2.5 insns per cycle
│ branch-instructions      2.21e+08  100.0%  #  6.1% of insns
└ branch-misses            6.24e+05  100.0%  #  0.3% of branch insns
┌ task-clock               3.69e+08  100.0%  # 369.2 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              8.64e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #2 (TID = 135865)
╶ cpu-cycles               1.24e+09  100.0%  #  3.9 cycles per ns
┌ instructions             3.60e+09  100.0%  #  2.9 insns per cycle
│ branch-instructions      2.20e+08  100.0%  #  6.1% of insns
└ branch-misses            5.79e+05  100.0%  #  0.3% of branch insns
┌ task-clock               3.14e+08  100.0%  # 314.0 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #3 (TID = 135866)
╶ cpu-cycles                     NA    0.0%
┌ instructions                   NA    0.0%
│ branch-instructions            NA    0.0%
└ branch-misses                  NA    0.0%
┌ task-clock               5.13e+08  100.0%  # 513.2 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #4 (TID = 135867)
╶ cpu-cycles                     NA    0.0%
┌ instructions                   NA    0.0%
│ branch-instructions            NA    0.0%
└ branch-misses                  NA    0.0%
┌ task-clock               5.15e+08  100.0%  # 515.3 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Aggregated
╶ cpu-cycles               6.74e+09   39.9%  #  3.9 cycles per ns
┌ instructions             1.81e+10   39.9%  #  2.7 insns per cycle
│ branch-instructions      1.10e+09   39.9%  #  6.1% of insns
└ branch-misses            3.02e+06   39.9%  #  0.3% of branch insns
┌ task-clock               1.71e+09  100.0%  #  1.7 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              8.65e+02  100.0%
                  aggregated from 4 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  0.567853 seconds (55.97 k allocations: 4.494 MiB)

julia --threads 8 --project src/literate/threaded_assembly.jl    
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Thread #1 (TID = 135887)
╶ cpu-cycles                     NA    0.0%
┌ instructions                   NA    0.0%
│ branch-instructions            NA    0.0%
└ branch-misses                  NA    0.0%
┌ task-clock               4.01e+08  100.0%  # 401.2 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              7.39e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #2 (TID = 135889)
╶ cpu-cycles               6.37e+08  100.0%  #  3.5 cycles per ns
┌ instructions             1.80e+09  100.0%  #  2.8 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            3.27e+05  100.0%  #  0.3% of branch insns
┌ task-clock               1.83e+08  100.0%  # 183.5 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #3 (TID = 135890)
╶ cpu-cycles                     NA    0.0%
┌ instructions                   NA    0.0%
│ branch-instructions            NA    0.0%
└ branch-misses                  NA    0.0%
┌ task-clock               2.78e+08  100.0%  # 277.7 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #4 (TID = 135891)
╶ cpu-cycles                     NA    0.0%
┌ instructions                   NA    0.0%
│ branch-instructions            NA    0.0%
└ branch-misses                  NA    0.0%
┌ task-clock               2.75e+08  100.0%  # 275.2 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #5 (TID = 135892)
╶ cpu-cycles               6.37e+08  100.0%  #  3.5 cycles per ns
┌ instructions             1.80e+09  100.0%  #  2.8 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            3.31e+05  100.0%  #  0.3% of branch insns
┌ task-clock               1.83e+08  100.0%  # 183.3 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #6 (TID = 135893)
╶ cpu-cycles                     NA    0.0%
┌ instructions                   NA    0.0%
│ branch-instructions            NA    0.0%
└ branch-misses                  NA    0.0%
┌ task-clock               2.78e+08  100.0%  # 277.5 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #7 (TID = 135894)
╶ cpu-cycles               6.36e+08  100.0%  #  3.5 cycles per ns
┌ instructions             1.80e+09  100.0%  #  2.8 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            3.24e+05  100.0%  #  0.3% of branch insns
┌ task-clock               1.83e+08  100.0%  # 183.2 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #8 (TID = 135895)
╶ cpu-cycles               6.14e+08  100.0%  #  3.5 cycles per ns
┌ instructions             1.80e+09  100.0%  #  2.9 insns per cycle
│ branch-instructions      1.09e+08  100.0%  #  6.1% of insns
└ branch-misses            3.22e+05  100.0%  #  0.3% of branch insns
┌ task-clock               1.77e+08  100.0%  # 177.5 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Aggregated
╶ cpu-cycles               6.80e+09   37.2%  #  3.5 cycles per ns
┌ instructions             1.94e+10   37.2%  #  2.8 insns per cycle
│ branch-instructions      1.18e+09   37.2%  #  6.1% of insns
└ branch-misses            3.51e+06   37.2%  #  0.3% of branch insns
┌ task-clock               1.96e+09  100.0%  #  2.0 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              7.41e+02  100.0%
                  aggregated from 8 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  0.418080 seconds (111.85 k allocations: 6.954 MiB)

KnutAM commented 2 years ago

Maybe completely wrong, but create_scratchvalues(K, f, dh) takes about 6 ms on a single core on my laptop, and it isn't threaded so that should increase time per thread. So 8 threads about 0.05 s Calling assemble_cell! takes me 12.3 us => numcells=0.98s and full run is 1.35 s. So perfect scaling for the loop would be 0.37 + 0.006nthreads + 0.98/nthreads "1: 1.356" "2: 0.872" "3: 0.7147" "4: 0.639" "8: 0.5405"

koehlerson commented 2 years ago

but only the pure assembly is timed. Everything else like creaate_scratchvalues shouldn't been taken into considerations, no?

ah well nvm, the creation is within the function

koehlerson commented 2 years ago

If I remove the buffer creation out of the function and instead pass it I get

max@archlinux ~/repos/Ferrite.jl/docs mk/tetcube_inconsitency* ❯ julia --threads 1 --project src/literate/threaded_assembly.jl                       10s 18:33:52
  1.191017 seconds (108 allocations: 17.719 KiB)
max@archlinux ~/repos/Ferrite.jl/docs mk/tetcube_inconsitency* ❯ julia --threads 2 --project src/literate/threaded_assembly.jl                     1m 8s 18:33:06
  0.624988 seconds (250 allocations: 35.938 KiB)
max@archlinux ~/repos/Ferrite.jl/docs mk/tetcube_inconsitency* ❯ julia --threads 4 --project src/literate/threaded_assembly.jl                       11s 18:33:21
  0.526636 seconds (471 allocations: 70.688 KiB)
max@archlinux ~/repos/Ferrite.jl/docs mk/tetcube_inconsitency* ❯ julia --threads 8 --project src/literate/threaded_assembly.jl                       11s 18:33:36
  0.300301 seconds (923 allocations: 140.500 KiB)

KristofferC commented 2 years ago

Mr Amdahl says hello.

koehlerson commented 2 years ago

Mr Amdahl says hello.

grafik

termi-official commented 2 years ago

I would argue this is not Ahmdahl's law. In this benchmark we have ~4000 entries per color and ~16 colors, so with 8 threads we should still not see this level of performance loss. I mean, the work per element is approximately constant and independent. Also, strong scaling is also not so great. Interestingly I also have some benchmarks where I fully decouple the assembly emulating a partition of the problem like in distributed problems and observed also scaling issues beyond 4 threads.

With n = 10

 ~/Tools/julia-1.8.2/bin/julia --threads 1 --project src/literate/threaded_assembly.jl                               ✔  21s  
DoFs: 36663
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Thread #1 (TID = 328117)
┌ cpu-cycles               8.70e+08  100.0%  #  2.8 cycles per ns
│ stalled-cycles-frontend  1.40e+06  100.0%  #  0.2% of cycles
└ stalled-cycles-backend   6.64e+08  100.0%  # 76.3% of cycles
┌ instructions             1.79e+09  100.0%  #  2.1 insns per cycle
│ branch-instructions      1.09e+08  100.0%  #  6.1% of insns
└ branch-misses            6.22e+05  100.0%  #  0.6% of branch insns
┌ task-clock               3.13e+08  100.0%  # 312.6 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              2.47e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
┌ cpu-cycles               8.70e+08  100.0%  #  2.8 cycles per ns
│ stalled-cycles-frontend  1.40e+06  100.0%  #  0.2% of cycles
└ stalled-cycles-backend   6.64e+08  100.0%  # 76.3% of cycles
┌ instructions             1.79e+09  100.0%  #  2.1 insns per cycle
│ branch-instructions      1.09e+08  100.0%  #  6.1% of insns
└ branch-misses            6.22e+05  100.0%  #  0.6% of branch insns
┌ task-clock               3.13e+08  100.0%  # 312.6 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              2.47e+02  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
0.300560 seconds (14.03 k allocations: 919.344 KiB)

With n = 20

 ~/Tools/julia-1.8.2/bin/julia --threads 8 --project src/literate/threaded_assembly.jl                               ✔  22s  
DoFs: 265923
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Thread #1 (TID = 327996)
┌ cpu-cycles               1.50e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  1.02e+08  100.0%  #  6.8% of cycles
└ stalled-cycles-backend   1.09e+09  100.0%  # 72.3% of cycles
┌ instructions             1.92e+09  100.0%  #  1.3 insns per cycle
│ branch-instructions      1.31e+08  100.0%  #  6.8% of insns
└ branch-misses            6.81e+05  100.0%  #  0.5% of branch insns
┌ task-clock               5.23e+08  100.0%  # 523.3 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              8.32e+02  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #2 (TID = 327998)
┌ cpu-cycles               1.03e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  9.28e+07  100.0%  #  9.0% of cycles
└ stalled-cycles-backend   6.69e+08  100.0%  # 64.6% of cycles
┌ instructions             1.78e+09  100.0%  #  1.7 insns per cycle
│ branch-instructions      1.06e+08  100.0%  #  6.0% of insns
└ branch-misses            6.45e+05  100.0%  #  0.6% of branch insns
┌ task-clock               3.60e+08  100.0%  # 359.7 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #3 (TID = 327999)
┌ cpu-cycles               1.14e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  1.28e+08  100.0%  # 11.2% of cycles
└ stalled-cycles-backend   7.78e+08  100.0%  # 68.0% of cycles
┌ instructions             1.79e+09  100.0%  #  1.6 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            5.96e+05  100.0%  #  0.5% of branch insns
┌ task-clock               3.97e+08  100.0%  # 396.6 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #4 (TID = 328000)
┌ cpu-cycles               1.17e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  1.30e+08  100.0%  # 11.1% of cycles
└ stalled-cycles-backend   8.04e+08  100.0%  # 68.5% of cycles
┌ instructions             1.79e+09  100.0%  #  1.5 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            6.00e+05  100.0%  #  0.5% of branch insns
┌ task-clock               4.05e+08  100.0%  # 405.5 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              2.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #5 (TID = 328001)
┌ cpu-cycles               1.08e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  9.13e+07  100.0%  #  8.5% of cycles
└ stalled-cycles-backend   7.60e+08  100.0%  # 70.5% of cycles
┌ instructions             1.79e+09  100.0%  #  1.7 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            5.98e+05  100.0%  #  0.5% of branch insns
┌ task-clock               3.74e+08  100.0%  # 374.0 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #6 (TID = 328002)
┌ cpu-cycles               1.17e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  1.29e+08  100.0%  # 11.1% of cycles
└ stalled-cycles-backend   8.00e+08  100.0%  # 68.5% of cycles
┌ instructions             1.79e+09  100.0%  #  1.5 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            5.96e+05  100.0%  #  0.5% of branch insns
┌ task-clock               4.09e+08  100.0%  # 409.3 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #7 (TID = 328003)
┌ cpu-cycles               1.20e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  1.40e+08  100.0%  # 11.7% of cycles
└ stalled-cycles-backend   8.27e+08  100.0%  # 69.1% of cycles
┌ instructions             1.79e+09  100.0%  #  1.5 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            5.91e+05  100.0%  #  0.5% of branch insns
┌ task-clock               4.15e+08  100.0%  # 414.8 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #8 (TID = 328004)
┌ cpu-cycles               1.11e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  1.38e+08  100.0%  # 12.5% of cycles
└ stalled-cycles-backend   7.39e+08  100.0%  # 66.9% of cycles
┌ instructions             1.79e+09  100.0%  #  1.6 insns per cycle
│ branch-instructions      1.10e+08  100.0%  #  6.1% of insns
└ branch-misses            5.95e+05  100.0%  #  0.5% of branch insns
┌ task-clock               3.83e+08  100.0%  # 382.7 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Aggregated
┌ cpu-cycles               9.40e+09  100.0%  #  2.9 cycles per ns
│ stalled-cycles-frontend  9.52e+08  100.0%  # 10.1% of cycles
└ stalled-cycles-backend   6.46e+09  100.0%  # 68.7% of cycles
┌ instructions             1.44e+10  100.0%  #  1.5 insns per cycle
│ branch-instructions      8.97e+08  100.0%  #  6.2% of insns
└ branch-misses            4.90e+06  100.0%  #  0.5% of branch insns
┌ task-clock               3.27e+09  100.0%  #  3.3 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              8.36e+02  100.0%
                  aggregated from 8 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  0.637576 seconds (111.86 k allocations: 6.954 MiB)

Edit: To elaborate the 8 threads, we are in 3d and i double the number of elements in each direction, so the load increases by ~2^3.

Edit 2: Interestingly here the frontend starts to stall. But I do not think this is enough to explain the performance drop.

termi-official commented 1 year ago

I think I was finally able to track down what is going wrong. The arithmetic intensity of the kernels in the current design of Ferrite is low, because we precompute a good amount of data. This implies that we hit memory limitations faster than expected - which explains the wildly varying scalings in the benchmarks on different systems reported above. We can observe this issue surfacing when monitoring the CPU utilizations (e.g. only 235% for 16 threads on my system).

To show that this is not Amdahl's law I equipartitioned each color so each thread has one task, and I pinned the tasks. Further, we can see in LIKWID, that the FLOPS collapse.

using Ferrite, SparseArrays, LIKWID

using ThreadPinning
pinthreads(:cores)

function create_colored_cantilever_grid(celltype, n)
    grid = generate_grid(celltype, (10*n, n, n), Vec{3}((0.0, 0.0, 0.0)), Vec{3}((10.0, 1.0, 1.0)))
    colors = create_coloring(grid)
    return grid, colors
end;

# #### DofHandler
function create_dofhandler(grid::Grid{dim}) where {dim}
    dh = DofHandler(grid)
    push!(dh, :u, dim) # Add a displacement field
    close!(dh)
end;

# ### Stiffness tensor for linear elasticity
function create_stiffness(::Val{dim}) where {dim}
    E = 200e9
    ν = 0.3
    λ = E*ν / ((1+ν) * (1 - 2ν))
    μ = E / (2(1+ν))
    δ(i,j) = i == j ? 1.0 : 0.0
    g(i,j,k,l) = λ*δ(i,j)*δ(k,l) + μ*(δ(i,k)*δ(j,l) + δ(i,l)*δ(j,k))
    C = SymmetricTensor{4, dim}(g);
    return C
end;

# ## Threaded data structures
#
# ScratchValues is a thread-local collection of data that each thread needs to own,
# since we need to be able to mutate the data in the threads independently
struct ScratchValues{T, CV <: CellValues, FV <: FaceValues, TT <: AbstractTensor, dim, Ti}
    Ke::Matrix{T}
    fe::Vector{T}
    cellvalues::CV
    facevalues::FV
    global_dofs::Vector{Int}
    ɛ::Vector{TT}
    coordinates::Vector{Vec{dim, T}}
    assembler::Ferrite.AssemblerSparsityPattern{T, Ti}
end;

# Each thread need its own CellValues and FaceValues (although, for this example we don't use
# the FaceValues)
function create_values(refshape, dim, order::Int)
    ## Interpolations and values
    interpolation_space = Lagrange{dim, refshape, 1}()
    quadrature_rule = QuadratureRule{dim, refshape}(order)
    face_quadrature_rule = QuadratureRule{dim-1, refshape}(order)
    cellvalues = [CellVectorValues(quadrature_rule, interpolation_space) for i in 1:Threads.nthreads()];
    facevalues = [FaceVectorValues(face_quadrature_rule, interpolation_space) for i in 1:Threads.nthreads()];
    return cellvalues, facevalues
end;

# Create a `ScratchValues` for each thread with the thread local data
function create_scratchvalues(K, f, dh::DofHandler{dim}) where {dim}
    nthreads = Threads.nthreads()
    assemblers = [start_assemble(K, f) for i in 1:nthreads]
    cellvalues, facevalues = create_values(RefCube, dim, 2)

    n_basefuncs = getnbasefunctions(cellvalues[1])
    global_dofs = [zeros(Int, ndofs_per_cell(dh)) for i in 1:nthreads]

    fes = [zeros(n_basefuncs) for i in 1:nthreads] # Local force vector
    Kes = [zeros(n_basefuncs, n_basefuncs) for i in 1:nthreads]

    ɛs = [[zero(SymmetricTensor{2, dim}) for i in 1:n_basefuncs] for i in 1:nthreads]

    coordinates = [[zero(Vec{dim}) for i in 1:length(dh.grid.cells[1].nodes)] for i in 1:nthreads]

    return [ScratchValues(Kes[i], fes[i], cellvalues[i], facevalues[i], global_dofs[i],
                         ɛs[i], coordinates[i], assemblers[i]) for i in 1:nthreads]
end;

# ## Threaded assemble

# The assembly function loops over each color and does a threaded assembly for that color
function doassemble(K::SparseMatrixCSC, colors, grid::Grid, dh::DofHandler, C::SymmetricTensor{4, dim}, f::Vector{Float64}, scratches::Vector{SV}, b::Vec{dim}) where {dim, SV}
    ## Each color is safe to assemble threaded
    for color in colors
        ## We try to equipartition the array to increase load per task.
        chunk_size = max(1, 1 + length(color) ÷ Threads.nthreads())
        color_partition = [color[i:min(i + chunk_size - 1, end)] for i in 1:chunk_size:length(color)]
        ## Now we should have a 1:1 correspondence between tasks and elements to assemble.
        Threads.@threads :static for i in 1:length(color_partition)
            for cellid ∈ color_partition[i]
                assemble_cell!(scratches[i], cellid, K, grid, dh, C, b)
            end
        end
    end

    return K, f
end

# The cell assembly function is written the same way as if it was a single threaded example.
# The only difference is that we unpack the variables from our `scratch`.
function assemble_cell!(scratch::ScratchValues, cell::Int, K::SparseMatrixCSC,
                        grid::Grid, dh::DofHandler, C::SymmetricTensor{4, dim}, b::Vec{dim}) where {dim}

    ## Unpack our stuff from the scratch
    Ke, fe, cellvalues, facevalues, global_dofs, ɛ, coordinates, assembler =
         scratch.Ke, scratch.fe, scratch.cellvalues, scratch.facevalues,
         scratch.global_dofs, scratch.ɛ, scratch.coordinates, scratch.assembler

    fill!(Ke, 0)
    fill!(fe, 0)

    n_basefuncs = getnbasefunctions(cellvalues)

    ## Fill up the coordinates
    nodeids = grid.cells[cell].nodes
    for j in 1:length(coordinates)
        coordinates[j] = grid.nodes[nodeids[j]].x
    end

    reinit!(cellvalues, coordinates)

    for q_point in 1:getnquadpoints(cellvalues)
        for i in 1:n_basefuncs
            ɛ[i] = symmetric(shape_gradient(cellvalues, q_point, i))
        end
        dΩ = getdetJdV(cellvalues, q_point)
        for i in 1:n_basefuncs
            δu = shape_value(cellvalues, q_point, i)
            fe[i] += (δu ⋅ b) * dΩ
            ɛC = ɛ[i] ⊡ C
            for j in 1:n_basefuncs
                Ke[i, j] += (ɛC ⊡ ɛ[j]) * dΩ
            end
        end
    end

    celldofs!(global_dofs, dh, cell)
    assemble!(assembler, global_dofs, fe, Ke)
end;

function run_assemble()
    refshape = RefCube
    quadrature_order = 5
    dim = 3
    n = 30
    grid, colors = create_colored_cantilever_grid(Hexahedron, n);
    dh = create_dofhandler(grid);

    K = create_sparsity_pattern(dh);
    C = create_stiffness(Val{3}());
    f = zeros(ndofs(dh))
    scratches = create_scratchvalues(K, f, dh)
    b = Vec{3}((0.0, 0.0, 0.0))
    ## compilation
    doassemble(K, colors, grid, dh, C, f, scratches, b);
    return @perfmon "FLOPS_DP" doassemble(K, colors, grid, dh, C, f, scratches, b);
end

metrics, events = run_assemble();
clocks = [res["Clock [MHz]"] for res in metrics["FLOPS_DP"]]
println("Clock [MHz] (min, avg, max): ", minimum(clocks), " | ", mean(clocks), " | " , maximum(clocks)) 

thread_times = [res["Runtime unhalted [s]"] for res in metrics["FLOPS_DP"]]
println("Runtime unhalted [s] (min, avg, max): ", minimum(thread_times), " | ", mean(thread_times), " | " , maximum(thread_times))

println("Total runtime [s] ", first([res["Runtime (RDTSC) [s]"] for res in metrics["FLOPS_DP"]]))

lijas commented 1 year ago

I dont understand why precomputing/allocating a bunch of values (e.g shape functions) would be worse for multithreaded code... but do you think a sort of "lazy" cellvalues would be better for this case?

function element_routine!()
for iqp in 1:getncellvalues(cv)
     reinit!(cv, iqp) #compute shapevalues, jacobian, shape_gradient for current gauss point here

The drawback is that we have to recompute N and dNd\xi for each element, but perhaps we could get better threading performance

termi-official commented 1 year ago

What works best has to be investigated further. Also, I am not 100% confident that this is the full picture, yet. Indeed, the common strategy to resolve such issues is to just recompute a bunch of values, or even go full matrix-free and just reevaluate every every time. Please also note that more for expensive assembly this problem is not so pronounced anymore.

The "problem" with precomputation (and one copy per thread) is just that you increase the number of accesses to the main memory while having a low number of float point operations in your assembly kernel. This way each assembly kernel puts some amount of pressure on your memory bus via the memory acces. Now, your memory bandwith is limited, while you put more workers on the problem, and with enough parallel workers this memory bandwith becomes the bottleneck, because you cannot move the variables from the memory to the processor cores fast enough to keep them busy. It is quite a simplification, but I hope that the point is clear - does that help?

KnutAM commented 1 year ago

I played around a bit with this on the cluster login node, and I got much better scaling than on my laptop. Except for the case in the example (old.jl) where the scratch is created inside the timing, it seems to scale quite ok? Or should one expect better scaling efficiency?

The code is in the kam/StaticValues branch with benchmarks referred to in the table in the dev folder. There I implemented "static" cellvalues such that isbitstype(cv)=true (in an attempt to see if things would end up on each core's L1/L2 cache, but didn't seem to work based on timings).

.	t [s]				t*threads				loss
threads	static.jl	dyn.jl	old.jl	dennis.jl	static.jl	dyn.jl	old.jl	dennis.jl	static.jl	dyn.jl	old.jl	dennis.jl
1	3.219	2.998	2.118	1.899	3.2	3.0	2.1	1.9	0%	0%	0%	0%
2	1.966	1.867	1.252	1.106	3.9	3.7	2.5	2.2	22%	25%	18%	16%
4	1.074	0.944	0.880	0.608	4.3	3.8	3.5	2.4	33%	26%	66%	28%
8	0.564	0.557	0.615	0.340	4.5	4.5	4.9	2.7	40%	49%	132%	43%
16	0.334	0.318	0.663	0.204	5.3	5.1	10.6	3.3	66%	70%	401%	72%

(loss = threads*t(threads)/t(1) - 1, where t is the time as a function of number of threads)

versioninfo

``` Julia Version 1.8.5 Commit 17cfb8e65ea (2023-01-08 06:45 UTC) Platform Info: OS: Linux (x86_64-linux-gnu) CPU: 40 × Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-13.0.1 (ORCJIT, broadwell) Threads: 1 on 40 virtual cores ```

termi-official commented 1 year ago

Thanks for taking the time Knut. I agree with you that the scaling is okay, but I would expect that we should be able to hit close to 16x for 16 threads here (or to be realistic around 14x). The problem is that modern CPU architectures scale up in core count, where the memory bandwidth limit can be quite severe. We will likely see worse performance hits than on our workstations. We can artificially drive up scaling with matrix-free techniques and utilize iterative solvers, but then the problem is that we do not have good preconditioners (because most state-of-the-art preconditioners need global matrix information).

KristofferC commented 1 year ago

This numbers should be compared to state of the art linear solvers for the problem imo. If no one can find a solver where our assembly speed is a bottle neck then we are "done". Sure, there is matrix free methods but then that should actually be what is benchmarked.

termi-official commented 1 year ago

Thanks for the comment here Kristoffer. I fully agree with you that we should primarily target to benchmark similar problems, i.e. assembly. I dug a bit into literature and ended up on the WorkStream paper (dx.doi.org/10.1145/2851488) again. According to the analysis and experiments therein, we should be able to increase the scalability quite a bit by increasing data locality. I will try to reproduce the results from the paper in Ferrite (based on the previous work by Fredrik) if I find some free time.

PetrKryslUCSD commented 1 year ago

Hello all. This thread may be relevant: https://discourse.julialang.org/t/parallel-assembly-of-a-finite-element-sparse-matrix/95947

PetrKryslUCSD commented 1 year ago

@termi-official : I would be curious to have your opinion on the data shown in the above thread.

termi-official commented 1 year ago

I answered in discourse.

KnutAM commented 1 year ago

I have to redo the threading model in FerriteAssembly because I got errors due to multi-layered threading (:static not ok), this should probably be included in Ferrite.jl's example as well (branches fe/spawn and fe/workstream are works in progress I believe (thanks for the inspiration Fredrik for my attempts))

Somehow when I'm running the "dennis.jl" benchmark again now on the cluster on Julia 1.9, I get even better results, with the following speedups: N	`dennis.jl`	`FerriteAssembly.jl`
1	1.000	1.000
2	1.965	1.888
4	3.895	3.727
8	7.655	7.359
16	13.455	14.350

(The current implementation in the example stalls at 8 threads, no further improvements. But scratches are created in the timing)

I haven't checked if this is due to new julia or other changes. (This was when submitting jobs to the cluster, which gave the most stable results. )

I agree with you that the scaling is okay, but I would expect that we should be able to hit close to 16x for 16 threads here (or to be realistic around 14x).

So we are here @termi-official :) But perhaps someone else should verify though on their computer/cluster environment. It might be something strange with my benchmarks.

termi-official commented 1 year ago

Whoa, thanks for the update Knut! I will definitely check after GAMM. :)

Edit: Forgot to answer here, I cannot reproduce Knut's results locally.

termi-official commented 1 year ago

Another thing which I had in mind was that it might be interesting to explore the usage of atomics in assemble! for threaded code, e.g. via assemble_atomic!.

termi-official commented 12 months ago

I just found this one https://dl.acm.org/doi/full/10.1145/3503925 "On Memory Traffic and Optimisations for Low-order Finite Element Assembly Algorithms on Multi-core CPUs" which I leave here for later reference.

termi-official commented 8 months ago

I have to redo the threading model in FerriteAssembly because I got errors due to multi-layered threading (:static not ok), this should probably be included in Ferrite.jl's example as well (branches fe/spawn and fe/workstream are works in progress I believe (thanks for the inspiration Fredrik for my attempts))

Somehow when I'm running the "dennis.jl" benchmark again now on the cluster on Julia 1.9, I get even better results, with the following speedups: N dennis.jl FerriteAssembly.jl 1 1.000 1.000 2 1.965 1.888 4 3.895 3.727 8 7.655 7.359 16 13.455 14.350

(The current implementation in the example stalls at 8 threads, no further improvements. But scratches are created in the timing)

I haven't checked if this is due to new julia or other changes. (This was when submitting jobs to the cluster, which gave the most stable results. )

I agree with you that the scaling is okay, but I would expect that we should be able to hit close to 16x for 16 threads here (or to be realistic around 14x).

So we are here @termi-official :) But perhaps someone else should verify though on their computer/cluster environment. It might be something strange with my benchmarks.

I cannot reproduce the original issue on Julia 1.10 with the code in https://github.com/Ferrite-FEM/Ferrite.jl/issues/526#issuecomment-1398588359 and with 1.10 I can reproduce the scaling you reported (14x speedup for 16 cores on a 16 core machine). :sweat_smile:

PetrKryslUCSD commented 8 months ago

@termi-official @KnutAM Hi, I'd like to test drive the thread based assembly in Ferrite. What is the correct way please? Thanks! Petr

termi-official commented 8 months ago

I think the easiest way is to create some empty project, add the code here https://github.com/Ferrite-FEM/Ferrite.jl/issues/526#issuecomment-1398588359 into some file (e.g. threaded-assembly-ferrite.jl). In this project you need to add

]add Ferrite#master, SparseArrays, LIKWID, ThreadPinning

and now you should be able to run the program via julia and collect the timings. Note that the scalability is still not great for larger systems, because the coloring approach is rather limited (as pointed out by Kristoffer). If you want to have a better scalable solution in Ferrite then you might need a bit longer, as I currently do not have the time to bring the scalable assembly PR over the finish line too soon. However, this approach will be a bit smaller on systems with a small amount of cores.

PetrKryslUCSD commented 8 months ago

Thanks. I was actually interested in your best approach reported above. Will that not work? Or is that the code in #526?

koehlerson commented 8 months ago

But we do have two different colorings, where one is from a Bangert Paper implemented by @fredrikekre

PetrKryslUCSD commented 8 months ago

But we do have two different colorings, where one is from a Bangert Paper implemented by @fredrikekre

Sorry, I do not follow. If I want to reproduce your results discussed in this thread (where you say you are happy with the speedups), what do I do? Do I follow this recipe https://github.com/Ferrite-FEM/Ferrite.jl/issues/526#issuecomment-1946832515?

termi-official commented 8 months ago

Thanks. I was actually interested in your best approach reported above. Will that not work? Or is that the code in #526?

No problem. On Julia 1.10 I get the expected scaling on my machine (14x on a 16 core machine) with the code in the comment. Previusly it was even below 10x. Or what do you mean by "Will that not work?".

But we do have two different colorings, where one is from a Bangert Paper implemented by @fredrikekre

Indeed. And in the paper the authors also show that their approach, which we use here, comes with limited scalability. See Fig. 3&4. But that is okay for now.

PetrKryslUCSD commented 8 months ago

If you want to have a better scalable solution in Ferrite

Thanks, I was just a bit unsure what the above meant.

PetrKryslUCSD commented 8 months ago

I think the easiest way is to create some empty project, add the code here #526 (comment) into some file (e.g. threaded-assembly-ferrite.jl). In this project you need to add
]add Ferrite#master, SparseArrays, LIKWID, ThreadPinning
and now you should be able to run the program via julia and collect the timings. Note that the scalability is still not great for larger systems, because the coloring approach is rather limited (as pointed out by Kristoffer). If you want to have a better scalable solution in Ferrite then you might need a bit longer, as I currently do not have the time to bring the scalable assembly PR over the finish line too soon. However, this approach will be a bit smaller on systems with a small amount of cores.

Do you remember the version it used to work with? I am getting these errors which seem to indicate the ferrite version I have is too far ahead:

ERROR: LoadError: The `CellVectorValues` interface has been reworked for Ferrite.jl 0.4.0:

 - `CellScalarValues` and `CellVectorValues` have been merged into a single type: `CellValues`
 - "Vectorization" of (scalar) interpolations should now be done on the interpolation
   instead of implicitly in the `CellValues` constructor.

Upgrade as follows:
 - Scalar fields: Replace usage of
       CellScalarValues(quad_rule, interpolation)
   with
       CellValues(quad_rule, interpolation)
 - Vector fields: Replace usage of
       CellVectorValues(quad_rule, interpolation)
   with
       CellValues(quad_rule, interpolation^dim)
   where `dim` is the dimension to vectorize to.

See CHANGELOG.md (https://github.com/Ferrite-FEM/Ferrite.jl/blob/master/CHANGELOG.md) for more details.

PetrKryslUCSD commented 8 months ago

I got it running with 0.3.11. Not sure if that's the right thing to do.

termi-official commented 8 months ago

Ah, sorry. Indeed I have a local modification to let the example run on master.

using Ferrite, SparseArrays, LIKWID

using ThreadPinning
pinthreads(:cores)

function create_colored_cantilever_grid(celltype, n)
    grid = generate_grid(celltype, (10*n, n, n), Vec{3}((0.0, 0.0, 0.0)), Vec{3}((10.0, 1.0, 1.0)))
    colors = create_coloring(grid)
    return grid, colors
end;

# #### DofHandler
function create_dofhandler(grid::Grid{dim}) where {dim}
    dh = DofHandler(grid)
    push!(dh, :u, dim) # Add a displacement field
    close!(dh)
end;

# ### Stiffness tensor for linear elasticity
function create_stiffness(::Val{dim}) where {dim}
    E = 200e9
    ν = 0.3
    λ = E*ν / ((1+ν) * (1 - 2ν))
    μ = E / (2(1+ν))
    δ(i,j) = i == j ? 1.0 : 0.0
    g(i,j,k,l) = λ*δ(i,j)*δ(k,l) + μ*(δ(i,k)*δ(j,l) + δ(i,l)*δ(j,k))
    C = SymmetricTensor{4, dim}(g);
    return C
end;

# ## Threaded data structures
#
# ScratchValues is a thread-local collection of data that each thread needs to own,
# since we need to be able to mutate the data in the threads independently
struct ScratchValues{T, CV <: CellValues, FV <: FaceValues, TT <: AbstractTensor, dim, Ti}
    Ke::Matrix{T}
    fe::Vector{T}
    cellvalues::CV
    facevalues::FV
    global_dofs::Vector{Int}
    ɛ::Vector{TT}
    coordinates::Vector{Vec{dim, T}}
    assembler::Ferrite.AssemblerSparsityPattern{T, Ti}
end;

# Each thread need its own CellValues and FaceValues (although, for this example we don't use
# the FaceValues)
function create_values(refshape, dim, order::Int)
    ## Interpolations and values
    interpolation_space = Lagrange{refshape, 1}()
    quadrature_rule = QuadratureRule{refshape}(order)
    face_quadrature_rule = FaceQuadratureRule{refshape}(order)
    cellvalues = [CellValues(quadrature_rule, interpolation_space^dim) for i in 1:Threads.nthreads()];
    facevalues = [FaceValues(face_quadrature_rule, interpolation_space^dim) for i in 1:Threads.nthreads()];
    return cellvalues, facevalues
end;

# Create a `ScratchValues` for each thread with the thread local data
function create_scratchvalues(K, f, dh::DofHandler{dim}) where {dim}
    nthreads = Threads.nthreads()
    assemblers = [start_assemble(K, f) for i in 1:nthreads]
    cellvalues, facevalues = create_values(RefHexahedron, dim, 2)

    n_basefuncs = getnbasefunctions(cellvalues[1])
    global_dofs = [zeros(Int, ndofs_per_cell(dh)) for i in 1:nthreads]

    fes = [zeros(n_basefuncs) for i in 1:nthreads] # Local force vector
    Kes = [zeros(n_basefuncs, n_basefuncs) for i in 1:nthreads]

    ɛs = [[zero(SymmetricTensor{2, dim}) for i in 1:n_basefuncs] for i in 1:nthreads]

    coordinates = [[zero(Vec{dim}) for i in 1:length(dh.grid.cells[1].nodes)] for i in 1:nthreads]

    return [ScratchValues(Kes[i], fes[i], cellvalues[i], facevalues[i], global_dofs[i],
                         ɛs[i], coordinates[i], assemblers[i]) for i in 1:nthreads]
end;

# ## Threaded assemble

# The assembly function loops over each color and does a threaded assembly for that color
function doassemble(K::SparseMatrixCSC, colors, grid::Grid, dh::DofHandler, C::SymmetricTensor{4, dim}, f::Vector{Float64}, scratches::Vector{SV}, b::Vec{dim}) where {dim, SV}
    ## Each color is safe to assemble threaded
    for color in colors
        ## We try to equipartition the array to increase load per task.
        chunk_size = max(1, 1 + length(color) ÷ Threads.nthreads())
        color_partition = [color[i:min(i + chunk_size - 1, end)] for i in 1:chunk_size:length(color)]
        ## Now we should have a 1:1 correspondence between tasks and elements to assemble.
        Threads.@threads :static for i in 1:length(color_partition)
            for cellid ∈ color_partition[i]
                assemble_cell!(scratches[i], cellid, K, grid, dh, C, b)
            end
        end
    end

    return K, f
end

# The cell assembly function is written the same way as if it was a single threaded example.
# The only difference is that we unpack the variables from our `scratch`.
function assemble_cell!(scratch::ScratchValues, cell::Int, K::SparseMatrixCSC,
                        grid::Grid, dh::DofHandler, C::SymmetricTensor{4, dim}, b::Vec{dim}) where {dim}

    ## Unpack our stuff from the scratch
    Ke, fe, cellvalues, facevalues, global_dofs, ɛ, coordinates, assembler =
         scratch.Ke, scratch.fe, scratch.cellvalues, scratch.facevalues,
         scratch.global_dofs, scratch.ɛ, scratch.coordinates, scratch.assembler

    fill!(Ke, 0)
    fill!(fe, 0)

    n_basefuncs = getnbasefunctions(cellvalues)

    ## Fill up the coordinates
    nodeids = grid.cells[cell].nodes
    for j in 1:length(coordinates)
        coordinates[j] = grid.nodes[nodeids[j]].x
    end

    reinit!(cellvalues, coordinates)

    for q_point in 1:getnquadpoints(cellvalues)
        for i in 1:n_basefuncs
            ɛ[i] = symmetric(shape_gradient(cellvalues, q_point, i))
        end
        dΩ = getdetJdV(cellvalues, q_point)
        for i in 1:n_basefuncs
            δu = shape_value(cellvalues, q_point, i)
            fe[i] += (δu ⋅ b) * dΩ
            ɛC = ɛ[i] ⊡ C
            for j in 1:n_basefuncs
                Ke[i, j] += (ɛC ⊡ ɛ[j]) * dΩ
            end
        end
    end

    celldofs!(global_dofs, dh, cell)
    assemble!(assembler, global_dofs, fe, Ke)
end;

function run_assemble()
    refshape = RefHexahedron
    dim = 3
    n = 30
    grid, colors = create_colored_cantilever_grid(Hexahedron, n);
    dh = create_dofhandler(grid);

    K = create_sparsity_pattern(dh);
    C = create_stiffness(Val{3}());
    f = zeros(ndofs(dh))
    scratches = create_scratchvalues(K, f, dh)
    b = Vec{3}((0.0, 0.0, 0.0))
    ## compilation
    doassemble(K, colors, grid, dh, C, f, scratches, b);
    return @perfmon ("FLOPS_DP", "BRANCH", "L2", "L3", "CACHE", "ICACHE", "TLB") doassemble(K, colors, grid, dh, C, f, scratches, b);
    # return doassemble(K, colors, grid, dh, C, f, scratches, b);
end

metrics, events = run_assemble();
clocks = [res["Clock [MHz]"] for res in metrics["FLOPS_DP"]]
println("Clock [MHz] (min, avg, max): ", minimum(clocks), " | ", mean(clocks), " | " , maximum(clocks)) 

thread_times = [res["Runtime unhalted [s]"] for res in metrics["FLOPS_DP"]]
println("Runtime unhalted [s] (min, avg, max): ", minimum(thread_times), " | ", mean(thread_times), " | " , maximum(thread_times))

println("Total runtime [s] ", first([res["Runtime (RDTSC) [s]"] for res in metrics["FLOPS_DP"]]))

10.540680736238489s/0.750866197848292s = 14x

PetrKryslUCSD commented 8 months ago

Cool, I will check it out. Thanks, @termi-official !

PetrKryslUCSD commented 8 months ago

I ran the simulations on Mac 2 Ultra (no thread pinning).

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 24 × Apple M2 Ultra
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 1 on 16 virtual cores

Results with 1, 2, 4, 8, 16 threads:

julia> 12.68 ./ [6.68, 3.61, 1.97, 1.06]
4-element Vector{Float64}:
 1.89820e+00
 3.51247e+00
 6.43655e+00
 1.19623e+01

How did you guys figure the serial execution time?

I took the serial execution from the result with a single thread. However, that is not quite right: the serial execution does not need the scratches, for instance. It doesn't need the colors either. So if the serial execution time goes down, the parallel efficiency will go down, by necessity.

KnutAM commented 8 months ago

Not sure how relevant FerriteAssembly is, but since you tagged me: I see quite some difference using this script, which assembles a similar case. (It requires Ferrite#master and adding the unregistered packages FerriteAssembly.jl and MaterialModelsBase.jl). This uses a bit different threading strategy with smaller sized chunks.

I don't count the difference in setup time (e.g. creating coloring and scratches) since the assembly is usually performed so many times. Based on e.g. #631, I guess that the loop order (which gets messed up for a colored grid) could be partly to blame for the difference between a serial and threaded setup.

Threads	time [s]	Scaling
1(serial)	9.74862	1
1(threaded)	10.32795	0.943906
2	5.467666	1.782958
4	2.77263	3.516019
8	1.392804	6.999275
16	0.713087	13.67102

termi-official commented 8 months ago

I ran the simulations on Mac 2 Ultra (no thread pinning).

Doesn't thread pinning give you some extra performance?

How did you guys figure the serial execution time?

I took the serial execution from the result with a single thread. However, that is not quite right: the serial execution does not need the scratches, for instance. It doesn't need the colors either. So if the serial execution time goes down, the parallel efficiency will go down, by necessity.

Since was interested in scalability here I took one thread as reference. But it should not matter as only the assembly loop is measured. For the total program runtime efficiency will indeed change quite a bit, also because everything but the assembly and solve is serial.

PetrKryslUCSD commented 8 months ago

Doesn't thread pinning give you some extra performance?

Not available on the mac.

PetrKryslUCSD commented 8 months ago

FinEtools threaded assembly. n=50.

pkrysl@samadira exp % ../julia-1.10.0/bin/julia -t 24 batch_def.jl 50
Current folder: /Users/pkrysl/exp
  Activating project at `~/exp`
   Resolving package versions...
  No Changes to `~/exp/Project.toml`
  No Changes to `~/exp/Manifest.toml`
   Resolving package versions...
  No Changes to `~/exp/Project.toml`
  No Changes to `~/exp/Manifest.toml`
   Resolving package versions...
  No Changes to `~/exp/Project.toml`
  No Changes to `~/exp/Manifest.toml`
   Resolving package versions...
  No Changes to `~/exp/Project.toml`
  No Changes to `~/exp/Manifest.toml`
    Updating registry at `~/.julia/registries/General.toml`
  No Changes to `~/exp/Project.toml`
  No Changes to `~/exp/Manifest.toml`
Threads.nthreads() = 24
┌ Warning: Operating system not supported by ThreadPinning.jl. Functions like `pinthreads` will be no-ops!
│ (Hide this warning via `ThreadPinning.Prefs.set_os_warning(false)`.)
└ @ ThreadPinning ~/.julia/packages/ThreadPinning/q7w4s/src/ThreadPinning.jl:93
N = parse(Int, ARGS[1]) = 50
[ Info: Serial Assembly
  0.000011 seconds (3 allocations: 288 bytes)
  0.000000 seconds
 16.624546 seconds (15.34 k allocations: 31.782 GiB, 1.76% gc time, 0.32% compilation time)
[ Info: Parallel Assembly with 1 Tasks
  0.003069 seconds (9 allocations: 85.831 MiB, 6.50% gc time)
  0.036835 seconds (112.54 k allocations: 16.100 GiB, 23.27% gc time, 76.08% compilation time)
  0.000001 seconds
  0.006946 seconds (6.06 k allocations: 406.320 KiB, 98.95% compilation time)
 13.250371 seconds (4.76 M allocations: 345.755 MiB, 0.21% gc time, 3.66% compilation time)
  7.361293 seconds (35 allocations: 20.829 GiB, 4.68% gc time)
[ Info: Serial Assembly
  0.000001 seconds (3 allocations: 288 bytes)
  0.000000 seconds
 16.702665 seconds (315 allocations: 31.781 GiB, 1.67% gc time)
[ Info: Parallel Assembly with 1 Tasks
  0.008336 seconds (9 allocations: 85.831 MiB, 2.72% gc time)
  0.007859 seconds (21 allocations: 16.093 GiB, 98.64% gc time)
  0.000001 seconds
  0.000026 seconds (31 allocations: 1.250 KiB)
 12.822084 seconds (293 allocations: 29.847 MiB)
  7.406948 seconds (35 allocations: 20.829 GiB, 4.91% gc time)
[ Info: Parallel Assembly with 2 Tasks
  0.019266 seconds (17 allocations: 85.832 MiB, 45.63% gc time)
  0.044106 seconds (22 allocations: 16.093 GiB, 99.68% gc time)
  0.000001 seconds
  0.000022 seconds (40 allocations: 1.625 KiB)
  6.823925 seconds (577 allocations: 59.694 MiB)
  7.490230 seconds (35 allocations: 20.829 GiB, 5.18% gc time)
[ Info: Parallel Assembly with 4 Tasks
  0.019106 seconds (33 allocations: 85.832 MiB, 44.28% gc time)
  0.045414 seconds (24 allocations: 16.093 GiB, 99.70% gc time)
  0.000001 seconds
  0.000024 seconds (58 allocations: 2.328 KiB)
  3.699887 seconds (1.15 k allocations: 119.387 MiB)
  8.023389 seconds (35 allocations: 20.829 GiB, 10.19% gc time)
[ Info: Parallel Assembly with 8 Tasks
  0.017301 seconds (65 allocations: 85.834 MiB, 53.32% gc time)
  0.045189 seconds (28 allocations: 16.093 GiB, 99.71% gc time)
  0.000001 seconds
  0.000024 seconds (94 allocations: 3.734 KiB)
  1.954882 seconds (2.28 k allocations: 238.773 MiB)
  8.035546 seconds (35 allocations: 20.829 GiB, 10.06% gc time)
[ Info: Parallel Assembly with 16 Tasks
  0.017221 seconds (130 allocations: 85.838 MiB, 50.30% gc time)
  0.044944 seconds (36 allocations: 16.093 GiB, 99.70% gc time)
  0.000001 seconds
  0.000033 seconds (167 allocations: 6.875 KiB)
  1.503361 seconds (4.55 k allocations: 477.547 MiB)
  7.826032 seconds (35 allocations: 20.829 GiB, 7.49% gc time)

The code:

println("Current folder: $(pwd())")
using Pkg
Pkg.activate(".")
Pkg.instantiate()
Pkg.add("FinEtools")
Pkg.add("FinEtoolsDeforLinear")
Pkg.add("ChunkSplitters")
Pkg.add("ThreadPinning")
Pkg.update()

module par_assembly_examples
using FinEtools
using FinEtoolsDeforLinear
using ChunkSplitters

function run_example(N = 10, ntasks = 2, do_serial = false)
    E = 1000.0
    nu = 0.4999 #Taylor data
    W = 25.0
    H = 50.0
    L = 50.0
    CTE = 0.0

    fens, fes = H8block(W, L, H, N, N, 10*N)

    MR = DeforModelRed3D
    material = MatDeforElastIso(MR, 0.0, E, nu, CTE)
    ir = GaussRule(3, 2)

    geom = NodalField(fens.xyz)
    u = NodalField(zeros(size(fens.xyz, 1), 3)) # displacement field

    numberdofs!(u)

    if do_serial
        @info "Serial Assembly"
        @time femm = FEMMDeforLinear(MR, IntegDomain(fes, ir), material)
        @time associategeometry!(femm, geom)
        @time K = stiffness(femm, geom, u)
    end

    @info "Parallel Assembly with $(ntasks) Tasks"

    function matrixcomputation!(femm, assembler)
        stiffness(femm, assembler, geom, u)
    end

    femms = FEMMDeforLinear[]
    @time for (ch, j) in chunks(1:count(fes), ntasks)
        femm = FEMMDeforLinear(MR, IntegDomain(subset(fes, ch), ir), material)
        associategeometry!(femm, geom)
        push!(femms, femm)
    end

    @time assembler = make_assembler(femms, SysmatAssemblerSparseSymm, u)
    @time start_assembler!(assembler)
    @time assemblers = make_task_assemblers(femms, assembler, SysmatAssemblerSparseSymm, u)
    @time parallel_matrix_assembly(femms, assemblers, matrixcomputation!)
    @time K = make_matrix!(assembler)

    true
end # run_example

end # module

@show Threads.nthreads()
using ThreadPinning
pinthreads(:cores)

@show N = parse(Int, ARGS[1])

using .Main.par_assembly_examples;

Main.par_assembly_examples.run_example(N, 1, true)
Main.par_assembly_examples.run_example(N, 1, true)

for ntasks in [2, 4, 8, 16]
    Main.par_assembly_examples.run_example(N, ntasks, false)
end

nothing

I think it would be interesting if you were to run this on your machine. I believe you have a Linux system with 64 cores?

On the mac 2 ultra I got the speedups

julia> 12.83 ./ [6.82, 3.69, 1.95, 1.50]
4-element Vector{Float64}:
 1.88123e+00
 3.47696e+00
 6.57949e+00
 8.55333e+00

julia>

which isn't that great for 16 threads.

PetrKryslUCSD commented 8 months ago

Not sure how relevant FerriteAssembly is, but since you tagged me: I see quite some difference using this script, which assembles a similar case. (It requires Ferrite#master and adding the unregistered packages FerriteAssembly.jl and MaterialModelsBase.jl). This uses a bit different threading strategy with unevenly sized chunks.

I don't count the difference in setup time (e.g. creating coloring and scratches) since the assembly is usually performed so many times. Based on e.g. #631, I guess that the loop order (which gets messed up for a colored grid) could be partly to blame for the difference between a serial and threaded setup.

Threads time [s] Scaling 1(serial) 9.74862 1 1(threaded) 10.32795 0.943906 2 5.467666 1.782958 4 2.77263 3.516019 8 1.392804 6.999275 16 0.713087 13.67102

How different is it from the code posted above here (https://github.com/Ferrite-FEM/Ferrite.jl/issues/526#issuecomment-1948025659)?

KnutAM commented 8 months ago

How different is it from the code posted above here (https://github.com/Ferrite-FEM/Ferrite.jl/issues/526#issuecomment-1948025659)?

The main differences are using Threads.@spawn and much smaller chunks (compared to length(color) ÷ Threads.nthreads() which is used above), aiming to give some load balancing.

The loop structure can be seen here, and the heuristic creation of chunks from color here.

Except for the TaskChunks implementation, the strategy is similar to the fe/spawn branch Fredrik created, see here.

PetrKryslUCSD commented 8 months ago

Comparison of performance using Ferrite and FinEtools for the cantilever problem. Timings for the assembly only.

Mac 2 Ultra, Julia 1.10.

Mesh with n=50.

Number of tasks	Ferrite	FinEtools
1	12.72	11.88
2	6.68	6.34
4	3.62	3.34
8	1.99	1.73
16	0.99	0.91

Mesh with n=90.

Number of tasks	Ferrite	FinEtools
1	74.92	72.36
2	39.13	37.60
4	21.04	20.00
8	11.99	9.93
16	6.21	5.24

PetrKryslUCSD commented 8 months ago

Not sure how relevant FerriteAssembly is, but since you tagged me: I see quite some difference using this script, which assembles a similar case. (It requires Ferrite#master and adding the unregistered packages FerriteAssembly.jl and MaterialModelsBase.jl). This uses a bit different threading strategy with unevenly sized chunks.

I tried adding the masters:

Pkg.add(url="https://github.com/Ferrite-FEM/Ferrite.jl.git#master")
Pkg.add(url="https://github.com/KnutAM/FerriteAssembly.jl.git#master")
Pkg.add(url="https://github.com/KnutAM/MaterialModelsBase.jl.git#master")

Unfortunately, I am getting an error:

What are the versions I should use? Thanks.

KnutAM commented 8 months ago

pkg> add Ferrite#master
pkg> add https://github.com/KnutAM/MaterialModelsBase.jl.git
pkg> add https://github.com/KnutAM/FerriteAssembly.jl.git

Should hopefully work

PetrKryslUCSD commented 8 months ago

Goody, I just had those two switched. Now it worked.

PetrKryslUCSD commented 8 months ago

Impressive speedups, @KnutAM !

julia> 3.65 ./ [1.83, 0.94, 0.51, 0.27]    
4-element Vector{Float64}:
 1.99454e+00
 3.88298e+00
 7.15686e+00
 1.35185e+01

Sorry: I made an error setting up the grid. Rerunning now. New measurements:

pkrysl@samadira new_ferrite % tail bench_thread_ferrite_assembly.txt
1 17.414321708 17.413865375 17.416540792
2 8.54891225 8.545035375 8.561896166
4 4.388465875 4.385740916 4.394549167
8 2.349980041 2.357513333 2.317562125
16 1.244327208 1.243291333 1.239926916

Overall slower than https://github.com/Ferrite-FEM/Ferrite.jl/issues/526#issuecomment-1948025659, but excellent speedup (17.41/1.24=1.40403e+01)!

KnutAM commented 8 months ago

Thanks for trying it out @PetrKryslUCSD! The latter timing series makes more sense, yes! I suspect the difference is due to assembling the internal force vector and the stiffness. An updated test case implements the equivalent element routine, which at least runs faster on my system.

High performance on FinEtools - nicely done! I'll try to run that later! Do I understand the code correctly that I should consider the time for parallel_matrix_assembly + make_matrix! for comparison with the Ferrite times?

PetrKryslUCSD commented 8 months ago

This is my script:

using Pkg
Pkg.activate("."); Pkg.instantiate()

Pkg.add(url="https://github.com/Ferrite-FEM/Ferrite.jl.git#master")
Pkg.add(url="https://github.com/KnutAM/MaterialModelsBase.jl.git")
Pkg.add(url="https://github.com/KnutAM/FerriteAssembly.jl.git")

Pkg.add("ThreadPinning")

using ThreadPinning
pinthreads(:cores)

using Ferrite
using FerriteAssembly
import FerriteAssembly.ExampleElements: LinearElastic

threading = Threads.nthreads() > 0
n = 50
grid = generate_grid(Hexahedron, (10*n,n,n))

ip = Lagrange{RefHexahedron,1}()^3
qr = QuadratureRule{RefHexahedron}(2)
dh = DofHandler(grid)
add!(dh, :u, ip)
close!(dh)
cv = CellValues(qr, ip)
material = LinearElastic(E=200e9, ν=0.3)
buffer = setup_domainbuffer(DomainSpec(dh, material, cv); threading)

K = create_sparsity_pattern(dh)
r = zeros(ndofs(dh))
assembler = start_assemble(K, r)

work!(assembler, buffer)

function dotiming(K, r, buffer)
    assembler = start_assemble(K, r)
    @elapsed work!(assembler, buffer)
end

open("bench_thread_ferrite_assembly.txt"; append=true) do fid
    write(fid, string(Threads.nthreads()))
    write(fid, " ")
    for _ in 1:3
        t = dotiming(K, r, buffer)
        write(fid, string(t))
        write(fid, " ")
    end
    write(fid, "\n")
end

PetrKryslUCSD commented 8 months ago

The newest FinEtools script:

println("Current folder: $(pwd())")
using Pkg
Pkg.activate(".")
Pkg.instantiate()
Pkg.add(url="https://github.com/PetrKryslUCSD/FinEtools.jl.git#main")
Pkg.add(url="https://github.com/PetrKryslUCSD/FinEtoolsDeforLinear.jl.git")
Pkg.add("ChunkSplitters")
Pkg.add("SymRCM")
Pkg.add("ThreadPinning")
Pkg.update()
Pkg.status()

module par_assembly_examples
using FinEtools
using FinEtoolsDeforLinear
using ChunkSplitters
using SymRCM

function run_example(N = 10, ntasks = 2, do_serial = false)
    E = 1000.0
    nu = 0.4999 #Taylor data
    W = 25.0
    H = 50.0
    L = 50.0
    CTE = 0.0

    fens, fes = H8block(W, L, H, N, N, 10*N)
    #C = connectionmatrix(FEMMBase(IntegDomain(fes, GaussRule(3, 1))), count(fens))
    #ordering = symrcm(C)
    #fens, fes = reordermesh(fens, fes, ordering)

    MR = DeforModelRed3D
    material = MatDeforElastIso(MR, 0.0, E, nu, CTE)
    ir = GaussRule(3, 2)

    geom = NodalField(fens.xyz)
    u = NodalField(zeros(size(fens.xyz, 1), 3)) # displacement field

    numberdofs!(u)

    if do_serial
        @info "Serial Assembly"
        @time femm = FEMMDeforLinear(MR, IntegDomain(fes, ir), material)
        @time associategeometry!(femm, geom)
        @time K = stiffness(femm, geom, u)
    end

    @info "Parallel Assembly with $(ntasks) Tasks"

    function matrixcomputation!(femm, assembler)
        stiffness(femm, assembler, geom, u)
    end

    femms = FEMMDeforLinear[]
    @time for (ch, j) in chunks(1:count(fes), ntasks)
        femm = FEMMDeforLinear(MR, IntegDomain(subset(fes, ch), ir), material)
        associategeometry!(femm, geom)
        push!(femms, femm)
    end

    @time assembler = make_assembler(femms, SysmatAssemblerSparseSymm, u)
    @time start_assembler!(assembler)
    @time assemblers = make_task_assemblers(femms, assembler, SysmatAssemblerSparseSymm, u)
    @time parallel_matrix_assembly(femms, assemblers, matrixcomputation!)
    @time K = make_matrix!(assembler)

    true
end # run_example

end # module

@show Threads.nthreads()
using ThreadPinning
pinthreads(:cores)

@show N = parse(Int, ARGS[1])
@show ntasks = parse(Int, ARGS[2])

using .Main.par_assembly_examples;

Main.par_assembly_examples.run_example(N, 1, true)
Main.par_assembly_examples.run_example(N, 1, true)

Main.par_assembly_examples.run_example(N, ntasks, false)
Main.par_assembly_examples.run_example(N, ntasks, false)

nothing

PetrKryslUCSD commented 8 months ago

@termi-official and @KnutAM : Is there enough novelty in your investigations of parallel matrix assembly for a paper? If so, would you be interested in writing something? Let me know your thoughts please at pkrysl@ucsd.edu. I look forward to it. P

PetrKryslUCSD commented 8 months ago

Do I understand the code correctly that I should consider the time for parallel_matrix_assembly + make_matrix! for comparison with the Ferrite times?

Sorry, I just noticed this. No, this should be equivalent to Ferrite create sparsity pattern + assembly.

Ferrite-FEM / Ferrite.jl

Threaded Assembly Performance Degradation #526

TODOs