Parallel assembly in JuliaFEM

KristofferC commented 5 years ago

The following is a roadmap/outline for how threaded assembly could start to be implemented in JuliaFEM. The scope is limited to shared memory parallel assembly. It will be continuously updated.

Sample assembly code

In order to make it easier to examplify things, here is some smaple code for an assembly routine taken from JuliaFEM:

function assemble_elements!{B}(problem::Problem{Heat}, assembly::Assembly,
                               elements::Vector{Element{B}}, time::Float64)

    bi = BasisInfo(B)
    ndofs = length(B)
    Ke = zeros(ndofs, ndofs)
    fe = zeros(ndofs)

    for element in elements
        fill!(Ke, 0.0)
        fill!(fe, 0.0)
        for ip in get_integration_points(element)
            J, detJ, N, dN = element_info!(bi, element, ip, time)
            k = element("thermal conductivity", ip, time)
            Ke += ip.weight * k*dN'*dN * detJ
            if haskey(element, "heat source")
                f = element("heat source", ip, time)
                fe += ip.weight * N'*f * detJ
            end
        end

        gdofs = get_gdofs(problem, element)
        add!(assembly.K, gdofs, gdofs, Ke)
        add!(assembly.f, gdofs, fe)
    end
end

Goal

The goal of parallel assembly is, of course, to be able to exploit all threads on the computer to speed up the assembly. Linear solvers are in practice always able to use multiple threads so if the assembly process is serial, that might be a bottleneck in the FEM simulation. There are two main difficulties with parallel assembly, getting it correct and getting it fast.

Correctness

As always with parallel code, one has to be careful not to introduce data races, which means that two threads might write to the same location at the same time. This means that every write operation must be audited to see if there is potential for another thread to write to that location in the same threaded loop. In the assembly rouine above we have writes to two types of data structures which I call "local buffers" and "global buffers"

Local buffers

Local buffers are small enough in size such that allocating a copy of them on each thread is not prohibitivly expensive. In the code above these would be bi, Ke and fe. For a parallell assembly this would be allocated

To simplify things, I would suggest that each problem defines a struct

struct HeatAssemblyBuffers{B}
   bi::BasisInfo{B}
   Ke::Matrix{Float64}
   fe::Vector{Float64}
end

HeatAssembly(B, ndofs::Int) = HeatAssembly(BasisInfo(B), zeros(ndofs, ndofs), zeros(ndofs))

We could then generate buffers for all threads as

[HeatAssemblyBuffers(B, ndofs) for i in 1:Threads.nthreads()]

If each thread uses its own HeatAssemblyBuffer then there should be no data race when writing to local buffers.

Global buffers

Global buffers are data structure that are large enough that we don't want to allocate them in every thread. This include the global stiffness matrix and the global "force" vector. If we do a parallel assembly naively by just threading the loop over the elements the data race is obvious. Two threads working on elements that share degrees of freedom will try to e.g. assemble into the global force vector at the same location at the same time. The solution to this is to make sure that this never happens.

One way of solving the above problem is "mesh coloring". By assigning a "color" to each element such that no element with the same color share degrees of freedom, we can assemble the elements of that color in parallel. We then wait for all the threads to be done before moving on to the next color. There are many algorithms for coloring meshes but to start with, a greedy one that just goes element by element and take the first color that is allowed should be good enough. An implementation of this can be found at https://github.com/KristofferC/JuAFEM.jl/blob/master/src/Grid/coloring.jl.

Another way to solve the problem above is to not assemble into a global structure but have e.g. a COO structure for the global stifness that each thread push into, and then in the end, sum all those together. This is, however, quite likely bad for performance.

Performance

Allocations local buffers

Julia contains a "stop the world" GC which means that every time the GC runs, all threads are stopped and wait for it to finish. If the assembly loop allocates a lot, and we run multiple threads, the GC will have to run quite often, likely removing much of the speedup in using multiple threads. It is therefore very important that the loop over the element doesn't allocate. I would argue, it should allocate 0.

As an exmaple, even if we rewrite the line

Ke += ip.weight * k*dN'*dN * detJ

to the more efficient

Ke .+= (ip.weight * k * detJ) .* (dN' * dN)

this still need to allocate a temporary for (dN' * dN). There are some solutions to this:

Use something like StaticArrays for the matrices which do not allocate.
Use something like Tensors.jl which do not allocate
Preallocate a buffer and use in-place matrix multiplication:
```
mul!(dN_dN, dN', dN)
Ke .+= (ip.weight * k * detJ) .* dN_dN
```

I would suggest trying to do point 1 or 2 because having to allocate so many in-place buffers can make the code hard to read and static data structures fits well for this problem.

Allocations global buffers

Using the coloring technique to assemble directly into a CSC sparse matrix means that there is zero allocation from this assembly step. The same matrix can be used for all time step. Using the method where each thread has its own COO matrix to assemble into, we need to convert this to CSC and add all the contributions together for every assembly step which will likely be expensive and require a lot of memory.

ahojukka5 commented 5 years ago

Would it be possible to implement parallel assembly to a higher level than inside assemble_elements! by calling it multiple times? The function takes a list of elements as input arguments, so in principle, we could call this function for elements of one color at a time. As a proof of concept, we don't even need to implement coloring, we just create two problems not sharing the same nodes.

Here is the point where assembling starts for each problem. So if we divide elements into different colors and implement direct assembly to CSC form, would it work?

for (element_type, elements) in group_by_element_type(elements)
    elements_by_color = group_by_color(elements) # or something similar
    @threads for elements in elements_by_color
        assemble_elements!(problem, assembly, elements, time)
    end
end

KristofferC commented 5 years ago

Doing it at a higher level would be desirable, yes.

In the loop there though, isn't different threads working on different colors at the same time? The threaded loop has to be over a set of elements with the same color. I think a complication now is that each problem is including the element loop in their assembly.

Instead of assemble! being defined as (pseudo code)

function assemble!(problem)
    buf = allocate_local_buffers(problem)
    for ele in elements
        ...
        # assemble for each element
        ...
    end
end

I think it would be better if a problem defined two things, ProblemLocalBuffers(::Problem) and assemble_element!(::Problem, ::ProblemLocalBuffers).

The code you linked would then be written (in parallel) as

for (element_type, elements) in group_by_element_type(elements)
    local_buffers = [ProblemLocalBuffers(problem, element_type) for i in 1:Threads.nthreads()]
    elements_by_color = group_by_color(elements) # or something similar
    for elements_with_one_color in elements_by_color
        Threads.@threads for elements in elements_with_one_color
            assemble_element!(problem, assembly, element, time, local_buffers[Threads.threadid()])
        end
    end
end

KristofferC commented 5 years ago

The structure above is quite similar how the threaded JuAFEM assembly is done:

https://github.com/KristofferC/JuAFEM_Performance_Study/blob/1b67548b8deab2562db141ccddac7ec070548a03/eifel_threaded_juafem.jl#L151-L159

    fill!(K, 0.0)
    fill!(f, 0.0)
    for color in colors
        # Each color is safe to assemble threaded
        Threads.@threads for i in 1:length(color)
            scratchvalue = get_scratchvalue!(scratchvalues, K, f, dh)
            element = color[i]
            assemble_cell!(scratchvalue, element, K, dh, prob_data, quad_data, u)
        end
    end

ahojukka5 commented 5 years ago

First, we actually had a function to assemble only one element at a time, but because we need to allocate workspace to reduce memory usage we go to the strategy of assembling all elements of the same type at a time.

But, we have some situations where we cannot assemble only one element at a time because we need certain information about the geometry around the element (e.g. in contact mechanics find segmentation before assembly). We could store that information to problem.properties, before threaded assembly.

Safest would be if we still can support the way of assembling all elements at once and have something like problem.threaded_assembly :: Bool determining which assembly should be used.

bplcn commented 5 years ago

Hi! I am a user of FEMSparse and it works quite well for single core. But it seems that the package now could not work for multi-threads. I am wondering if it is necessary to modify it by assembling n sparse matrix and adding them together at last.

ahojukka5 commented 5 years ago

What kind of problems are you experiencing? If you create a sparsity pattern first and then assemble parallel using coloring, it should work.

bplcn commented 5 years ago

What kind of problems are you experiencing? If you create a sparsity pattern first and then assemble parallel using coloring, it should work.

I checked my code and fix it. I used to store both the Kgloabl and fglobal in a mutable struct but now I store them in two individual array and there is no problem. Thank you for your reply!

ahojukka5 commented 5 years ago

Ok, great. If you get some results how well is your code scaling when you increase the number of threads, I would be interested to see them.

bplcn commented 5 years ago

Ok, great. If you get some results how well is your code scaling when you increase the number of threads, I would be interested to see them.

I just noticed that it worked because I did not use multi-thread marco :( . Now the problem occurs again: error("some row indices were not found") It is an error message from FEMSparse and the program still cannot work without the error line. I think that it is necessary for me to take consideration of the coloring algorithm you mentioned to accelerate the assembly process with FEMSparse. Anyway, the sequential FEMSparse is still much quicker than function I coded with 4 threads for 160000 2D rectangle elements

bplcn commented 5 years ago

...Finally, I know the reason that it cannot assemble parallelly. I create individual assembler for each thread and add the assembler.K together. However, the add operation for SparseCSC seems to drop the zero data, so in the next iteration the FEMSparse cannot work for the location that has not been defined...I have rewritten a function for the sparse matrix adding and it now seems to work again.

JuliaFEM / JuliaFEM.jl