Closed KristofferC closed 1 year ago
Thanks! It's nice to have this option and we'll definitely include it in StateSpaceEcon.
For our internal use, we'll probably need a commercial license for Pardiso - I don't think it's free and open source. We'll ask our legal services to check that.
MKL should be fine to use even for commercial use (https://www.intel.com/content/www/us/en/developer/articles/tool/onemkl-license-faq.html) But please check with someone else :p. It is not open source though.
I want to write a little bit more here about further gains that can be had with Pardiso. I haven't implemented it yet because I still don't have the full picture in how the Jacobian changes during the simulation.
Pardiso has three stages:
From my timings (see below) step 1 takes about 75%, step 2 takes about 25% and step 3 is fast.
So the first step to be efficient is to only do the numerical factorization when the numerical values in the matrix have been updated. I think the current code already does this.
But! Even if the numerical values in the matrix have changed, if the sparsity pattern is the same, that is, the same indices in the matrix is non-zero, the symbolical factorization is still valid and only step 2 and 3 has to be redone. Since step 1 is the most expensive one this can have a quite dramatic speedup. It is quite common for the sparsity pattern to be constant along a simulation since the "structure" of the problem tend not to change.
I stored the matrix in Matrix Market format (https://github.com/JuliaSparse/MatrixMarket.jl) and attached it to this post as a zip file (J.zip) (required by github). The code below sets up some functions that times the different steps in the solution process, loads the matrix and times it.
mutable struct PardisoFactorization
ps::MKLPardisoSolver
J::SparseMatrixCSC
end
# See https://github.com/JuliaSparse/Pardiso.jl/blob/master/examples/exampleunsym.jl
function pardiso_factorize(JJ)
ps = MKLPardisoSolver()
set_matrixtype!(ps, Pardiso.REAL_NONSYM)
pardisoinit(ps)
fix_iparm!(ps, :N)
# See https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/sparse-solver-routines/onemkl-pardiso-parallel-direct-sparse-solver-iface/pardiso-iparm-parameter.html
set_iparm!(ps, 2, 2) # The parallel (OpenMP) version of the nested dissection algorithm.
JJ_pardiso = get_matrix(ps, JJ, :N)
@info "Step 1: Analysis"
set_phase!(ps, Pardiso.ANALYSIS)
@time pardiso(ps, JJ_pardiso, Float64[])
pf = PardisoFactorization(ps, JJ_pardiso)
finalizer(pf) do x
set_phase!(x.ps, Pardiso.RELEASE_ALL)
pardiso(x.ps)
end
end
function pardiso_solve!(ps::PardisoFactorization, x::AbstractVector)
ps, J = ps.ps, ps.J
@info "Step 2: Numerical factorization:"
set_phase!(ps, Pardiso.NUM_FACT)
@time pardiso(ps, J, Float64[])
@info "Step 3: Numerical factorization:"
set_phase!(ps, Pardiso.SOLVE_ITERATIVE_REFINE)
X = similar(x) # Solution is stored in X
@time pardiso(ps, X, J, x)
copy!(x, X)
end
julia> using MatrixMarket
julia> J = MatrixMarket.mmread("J.mtx")
95037×95037 SparseMatrixCSC{Float64, Int64} with 473805 stored entries:
⎡⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎤
...
julia> ps = pardiso_factorize(J);
[ Info: Step 1: Analysis
0.460367 seconds (10 allocations: 4.340 MiB)
julia> pardiso_solve!(ps, copy(b));
[ Info: Step 2: Numerical factorization:
0.123646 seconds (10 allocations: 4.340 MiB, 2.20% gc time)
[ Info: Step 3: Numerical factorization:
0.008820 seconds (9 allocations: 4.703 MiB)
I implemented the reusing of the symbolic factorization in the last commit. It is a bit ugly and could use some cleanup but the I mostly wanted to see the performance change and it is significant:
Before the reuse of the factorization I get:
─────────────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 5.86s / 99.3% 0.95GiB / 89.6%
Section ncalls time %tot avg alloc %tot avg
─────────────────────────────────────────────────────────────────────────────
stackedtime_RJ 8 5.74s 98.6% 718ms 810MiB 93.3% 101MiB
factorize 8 4.70s 80.7% 587ms 69.5MiB 8.0% 8.68MiB
eval_RJ 1.90k 995ms 17.1% 525μs 664MiB 76.5% 358KiB
─────────────────────────────────────────────────────────────────────────────
After the reuse I get:
───────────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 2.71s / 96.9% 936MiB / 88.2%
Section ncalls time %tot avg alloc %tot avg
───────────────────────────────────────────────────────────────────────────
stackedtime_RJ 8 2.54s 96.6% 317ms 779MiB 94.4% 97.4MiB
factorize 8 1.48s 56.2% 185ms 39.1MiB 4.7% 4.88MiB
eval_RJ 1.90k 1.02s 38.9% 539μs 664MiB 80.4% 358KiB
───────────────────────────────────────────────────────────────────────────
Thanks @KristofferC, this is a great idea.
I figured out that it is possible to cache the symbolic step and reuse it with UMFPACK as well. So I'll add this and I'll also keep your version with pardiso and integrate the two so users can easily switch between umfpack and pardiso. I'll do that in a separate PR. Thanks again!
Pardiso is a sparse linear solver that comes with MKL and can have a significant speed improvement over the default solver in julia (UMFPACK).
This PR is a proof of concept of using Pardiso in the stacked time solver. Running the FRB US model with the noise it performs around 2x better on my machine compared to the default: