JuliaLinearAlgebra / AlgebraicMultigrid.jl

Algebraic Multigrid in Julia
Other
116 stars 23 forks source link

Multithreading not working when using OpenBLAS #107

Open ma-sadeghi opened 10 months ago

ma-sadeghi commented 10 months ago

I haven't done a rigorous profile, but eyeballing the CPU usage, it seems that the _solve step is single-threaded. Do you plan on adding multithreading? Thanks!

ma-sadeghi commented 10 months ago

Update 1: After installing MKL, multithreading seems to be working. Maybe this is related: JuliaLang/julia/issues/49455?

ma-sadeghi commented 10 months ago

Update 2: Upon closer looking, even with MKL, it seems that the multithreaded part is only ~5% of the total run time. Here's a screenshot of CPU usage during the _solve phase:

image

termi-official commented 10 months ago

On master there is nothing explicitly threaded in AlgebraicMultigrid.jl. Only for the coarse solver you can expect that it might be multithreaded. I will try to provide threaded smoothers, RAP and setup algorithms in the future. Right now I don't have enough time and hands to implement it. However, if you (or someone else reading this) want to go ahead, then also feel free to grab this and ping me in the PR or here.

ma-sadeghi commented 10 months ago

Thanks! Quick question: Any intuition on which part might be the bottleneck?

termi-official commented 10 months ago

In the default setup it is likely the coarse solver (pinv). In general: It depends on what smoothers, cycle and coarse solver you choose.

ma-sadeghi commented 10 months ago

I'm trying to solve the Laplace equation on a very large system of equations, ~ 1 billion unknowns. I'm willing to spend some time on a PR if there are significant gains to be achieved. I know it's difficult to know in advance without proper profiling the code though.

termi-official commented 10 months ago

If you want to keep to the shared memory environment, then I would suggest that you setup a Laplace problem with fewer dofs first and measure with e.g. https://github.com/KristofferC/TimerOutputs.jl the setup, smoother, RAP and coarse solve timings. Also, if it is possible use a different coarse solver. I think I forgot to add it to docs, so here is an example how to swap the coarse solver

using AlgebraicMultigrid, LinearSolve
ml = ruge_stuben(A, coarse_solver=AlgebraicMultigrid.LinearSolveWrapper(UMFPACKFactorization()))

The amount of speedup you can achieve depends on your specific system at hand (i.e. number of cores, cache sizes, memory bandwidth,...).

termi-official commented 10 months ago

Speeding up RAP should also be straight forward. We can try something in the direction of https://github.com/BacAmorim/ThreadedSparseCSR.jl/blob/main/src/batch_matmul.jl#L13-L34 .

termi-official commented 10 months ago

cc @tirtho109 @fredrikekre is there an update regarding the distributed memory implementation?