proof of concept: allow using Pardiso as a linear solver

KristofferC commented 1 year ago

Pardiso is a sparse linear solver that comes with MKL and can have a significant speed improvement over the default solver in julia (UMFPACK).

This PR is a proof of concept of using Pardiso in the stacked time solver. Running the FRB US model with the noise it performs around 2x better on my machine compared to the default:

Section            ncalls     time    %tot     avg     alloc    %tot      avg
─────────────────────────────────────────────────────────────────────────────
factorize           8    3.66s   57.9%   458ms   34.7MiB    1.8%  4.34MiB

StateSpaceEcon.StackedTimeSolver.USE_PARDISO_FOR_LU = false:

Section            ncalls     time    %tot     avg     alloc    %tot      avg
─────────────────────────────────────────────────────────────────────────────
factorize           8    8.43s   82.3%   1.05s   5.01GiB   73.4%   642MiB

bbejanov commented 1 year ago

Thanks! It's nice to have this option and we'll definitely include it in StateSpaceEcon.

For our internal use, we'll probably need a commercial license for Pardiso - I don't think it's free and open source. We'll ask our legal services to check that.

KristofferC commented 1 year ago

MKL should be fine to use even for commercial use (https://www.intel.com/content/www/us/en/developer/articles/tool/onemkl-license-faq.html) But please check with someone else :p. It is not open source though.

KristofferC commented 1 year ago

I want to write a little bit more here about further gains that can be had with Pardiso. I haven't implemented it yet because I still don't have the full picture in how the Jacobian changes during the simulation.

Pardiso has three stages:

Symbolical factorization (only requires the sparsity pattern of the matrix, the actual values in the matrix are unused).
Numerical factorization (uses the numerical values in the matrix)
Solving (uses the RHS)

From my timings (see below) step 1 takes about 75%, step 2 takes about 25% and step 3 is fast.

So the first step to be efficient is to only do the numerical factorization when the numerical values in the matrix have been updated. I think the current code already does this.

But! Even if the numerical values in the matrix have changed, if the sparsity pattern is the same, that is, the same indices in the matrix is non-zero, the symbolical factorization is still valid and only step 2 and 3 has to be redone. Since step 1 is the most expensive one this can have a quite dramatic speedup. It is quite common for the sparsity pattern to be constant along a simulation since the "structure" of the problem tend not to change.

I stored the matrix in Matrix Market format (https://github.com/JuliaSparse/MatrixMarket.jl) and attached it to this post as a zip file (J.zip) (required by github). The code below sets up some functions that times the different steps in the solution process, loads the matrix and times it.

mutable struct PardisoFactorization
    ps::MKLPardisoSolver
    J::SparseMatrixCSC
end

# See https://github.com/JuliaSparse/Pardiso.jl/blob/master/examples/exampleunsym.jl
function pardiso_factorize(JJ)
    ps = MKLPardisoSolver()
    set_matrixtype!(ps, Pardiso.REAL_NONSYM)
    pardisoinit(ps)
    fix_iparm!(ps, :N)
    # See https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/sparse-solver-routines/onemkl-pardiso-parallel-direct-sparse-solver-iface/pardiso-iparm-parameter.html
    set_iparm!(ps, 2, 2) # The parallel (OpenMP) version of the nested dissection algorithm. 
    JJ_pardiso = get_matrix(ps, JJ, :N)

    @info "Step 1: Analysis"
    set_phase!(ps, Pardiso.ANALYSIS)
    @time pardiso(ps, JJ_pardiso, Float64[])

    pf = PardisoFactorization(ps, JJ_pardiso)
    finalizer(pf) do x
        set_phase!(x.ps, Pardiso.RELEASE_ALL)
        pardiso(x.ps)
    end
end

function pardiso_solve!(ps::PardisoFactorization, x::AbstractVector)
    ps, J = ps.ps, ps.J

    @info "Step 2: Numerical factorization:"
    set_phase!(ps, Pardiso.NUM_FACT)
    @time pardiso(ps, J, Float64[])

    @info "Step 3: Numerical factorization:"
    set_phase!(ps, Pardiso.SOLVE_ITERATIVE_REFINE)
    X = similar(x) # Solution is stored in X
    @time pardiso(ps, X, J, x)
    copy!(x, X)
end

julia> using MatrixMarket

julia> J = MatrixMarket.mmread("J.mtx")
95037×95037 SparseMatrixCSC{Float64, Int64} with 473805 stored entries:
⎡⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎤
...

julia> ps = pardiso_factorize(J);
[ Info: Step 1: Analysis
  0.460367 seconds (10 allocations: 4.340 MiB)

julia> pardiso_solve!(ps, copy(b));
[ Info: Step 2: Numerical factorization:
  0.123646 seconds (10 allocations: 4.340 MiB, 2.20% gc time)
[ Info: Step 3: Numerical factorization:
  0.008820 seconds (9 allocations: 4.703 MiB)

KristofferC commented 1 year ago

I implemented the reusing of the symbolic factorization in the last commit. It is a bit ugly and could use some cleanup but the I mostly wanted to see the performance change and it is significant:

Before the reuse of the factorization I get:

 ─────────────────────────────────────────────────────────────────────────────
                                     Time                    Allocations      
                            ───────────────────────   ────────────────────────
      Tot / % measured:          5.86s /  99.3%           0.95GiB /  89.6%    

 Section            ncalls     time    %tot     avg     alloc    %tot      avg
 ─────────────────────────────────────────────────────────────────────────────
   stackedtime_RJ        8    5.74s   98.6%   718ms    810MiB   93.3%   101MiB
     factorize           8    4.70s   80.7%   587ms   69.5MiB    8.0%  8.68MiB
     eval_RJ         1.90k    995ms   17.1%   525μs    664MiB   76.5%   358KiB
 ─────────────────────────────────────────────────────────────────────────────

After the reuse I get:

 ───────────────────────────────────────────────────────────────────────────
                                   Time                    Allocations      
                          ───────────────────────   ────────────────────────
     Tot / % measured:         2.71s /  96.9%            936MiB /  88.2%    

 Section          ncalls     time    %tot     avg     alloc    %tot      avg
 ───────────────────────────────────────────────────────────────────────────
 stackedtime_RJ        8    2.54s   96.6%   317ms    779MiB   94.4%  97.4MiB
   factorize           8    1.48s   56.2%   185ms   39.1MiB    4.7%  4.88MiB
   eval_RJ         1.90k    1.02s   38.9%   539μs    664MiB   80.4%   358KiB
 ───────────────────────────────────────────────────────────────────────────

bbejanov commented 1 year ago

Thanks @KristofferC, this is a great idea.

I figured out that it is possible to cache the symbolic step and reuse it with UMFPACK as well. So I'll add this and I'll also keep your version with pardiso and integrate the two so users can easily switch between umfpack and pardiso. I'll do that in a separate PR. Thanks again!

bankofcanada / StateSpaceEcon.jl

proof of concept: allow using Pardiso as a linear solver #43