Closed orlox closed 9 months ago
Just want to summarize some random experiments I've been trying out. For simplicity, doing this completely independent of the code in Jems. I'm just using a simpler system that will produce a block tri-diagonal matrix filled with ones. The code below shows an example of this following a similar structure of what we do in Solver.jl
using BlockBandedMatrices, ForwardDiff, Base.Threads
nz = 1000 # number of zones
nvars = 20 # number of variables
## Create a block-tridiagonal matrix to store the Jacobian
l, u = 1, 1 # block bandwidths
N = M = nz # number of row/column blocks
cols = rows = [nvars for i = 1:N] # block sizes
jacobian = BlockBandedMatrix(Zeros(sum(rows), sum(cols)), rows, cols, (l, u))
# For nz=4, nvars=2, this looks like this
# 4×4-blocked 8×8 BlockBandedMatrix{Float64}:
# 0.0 0.0 │ 0.0 0.0 │ ⋅ ⋅ │ ⋅ ⋅
# 0.0 0.0 │ 0.0 0.0 │ ⋅ ⋅ │ ⋅ ⋅
# ──────────┼────────────┼────────────┼──────────
# 0.0 0.0 │ 0.0 0.0 │ 0.0 0.0 │ ⋅ ⋅
# 0.0 0.0 │ 0.0 0.0 │ 0.0 0.0 │ ⋅ ⋅
# ──────────┼────────────┼────────────┼──────────
# ⋅ ⋅ │ 0.0 0.0 │ 0.0 0.0 │ 0.0 0.0
# ⋅ ⋅ │ 0.0 0.0 │ 0.0 0.0 │ 0.0 0.0
# ──────────┼────────────┼────────────┼──────────
# ⋅ ⋅ │ ⋅ ⋅ │ 0.0 0.0 │ 0.0 0.0
# ⋅ ⋅ │ ⋅ ⋅ │ 0.0 0.0 │ 0.0 0.0
## Make a mock set of equations to be solved for each row
function eval_cell_eqs!(y, x, k, nz, nvars)
# k is the cell for which I am computing the equations
# y = [y_1,y_2] are the discretized equations I want to solve.
# x are the independent variables in the current and neighboring cells,
# for example for k=2 I would have [T1,P1,T2,P2,T3,P3]. For edge cases
# I only have the relevant subset of variables, eg for k=1, x=[T1,P1,T2,P2]
for i in 1:nvars
y[i] = 0
if k==1 || k==nz
for j in 1:2*nvars
y[i] = y[i] + x[j]
end
else
for j in 1:3*nvars
y[i] = y[i] + x[j]
end
end
end
end
## Create a function that computes a single row of the jacobian in-place
function eval_jacobian_row!(xvalues, yvalues, jacobian, k, nz, nvars, equ_wrapper, jac_cfg)
# we construct a view for the segment of xvalues and yvalues we need for cell k
# we do the same for the jacobian
ki = 0
kf = 0
if k==1
ki = nvars*(k-1)+1
kf = nvars*(k+1)
elseif k==nz
ki = nvars*(k-2)+1
kf = nvars*(k)
else
ki = nvars*(k-2)+1
kf = nvars*(k+1)
end
jac_view = view(jacobian,nvars*(k-1)+1:nvars*k,ki:kf)
yvalues_view = view(yvalues, nvars*(k-1)+1:nvars*k)
xvalues_view = view(xvalues, ki:kf)
ForwardDiff.jacobian!(jac_view, equ_wrapper,
yvalues_view, xvalues_view, jac_cfg)
end
## This function parallelizes the calculation of each row
function eval_jacobian!(xvalues, yvalues, jacobian, nz, nvars, equ_wrappers, jac_cfgs)
#for k in 1:nz
Threads.@threads for k in 1:nz
eval_jacobian_row!(xvalues, yvalues, jacobian, k, nz, nvars, equ_wrappers[k], jac_cfgs[k])
end
end
##initialize xvalues, yvalues, and compute the Jacobian
xvalues = rand(nz*nvars)
yvalues = zeros(nz*nvars)
equ_wrappers = [(y,x)->eval_cell_eqs!(y, x, k, nz, nvars) for k in 1:nz]
function xsize(k,nz,nvars)
if k==1 || k==nz
return 2*nvars
else
return 3*nvars
end
end
jac_cfgs = [ForwardDiff.JacobianConfig(equ_wrappers[k], zeros(nvars), zeros(xsize(k,nz,nvars)), ForwardDiff.Chunk{(k==1||k==nz ? 2*nvars : 3*nvars)}()) for k in 1:nz]
eval_jacobian!(xvalues, yvalues, jacobian, nz, nvars, equ_wrappers, jac_cfgs)
jacobian
# For nz=4, nvars=2, This one looks like this:
# 4×4-blocked 8×8 BlockBandedMatrix{Float64}:
# 1.0 1.0 │ 1.0 1.0 │ ⋅ ⋅ │ ⋅ ⋅
# 1.0 1.0 │ 1.0 1.0 │ ⋅ ⋅ │ ⋅ ⋅
# ──────────┼────────────┼────────────┼──────────
# 1.0 1.0 │ 1.0 1.0 │ 1.0 1.0 │ ⋅ ⋅
# 1.0 1.0 │ 1.0 1.0 │ 1.0 1.0 │ ⋅ ⋅
# ──────────┼────────────┼────────────┼──────────
# ⋅ ⋅ │ 1.0 1.0 │ 1.0 1.0 │ 1.0 1.0
# ⋅ ⋅ │ 1.0 1.0 │ 1.0 1.0 │ 1.0 1.0
# ──────────┼────────────┼────────────┼──────────
# ⋅ ⋅ │ ⋅ ⋅ │ 1.0 1.0 │ 1.0 1.0
# ⋅ ⋅ │ ⋅ ⋅ │ 1.0 1.0 │ 1.0 1.0
##
using BenchmarkTools
@benchmark eval_jacobian!($xvalues, $yvalues, $jacobian, $nz, $nvars, $equ_wrappers, $jac_cfgs)
In the equations I'm just adding up over all variables at the current and neighboring cells, so I get the block tri-diagonal structure with ones. I am fixing the chunk size (ForwardDiff.Chunk{(k==1||k==nz ? 2*nvars : 3*nvars)}()
) in order to compare with a different version of this code. The resulting benchmark for eval_jacobian gives this on my laptop:
Now there two problems with how things are done here:
Now, I found a way to precompute those repeated sums. It involves using the facilities provided by the PreallocationTools
package. This allows me to define preallocated objects that are turned into dual numbers, allowing one to perform operations without constantly reallocating some things. Probably a bit dangerous, but what I'm doing is directly editing these preallocated objects before the call to ForwardDiff.jacobian. By doing that, when I compute the Jacobian I can access these preallocated objects and they will have the results I need, preventing the need of repeated computation. The example below is a bit long though:
using BlockBandedMatrices, ForwardDiff, Base.Threads, DiffResults, BenchmarkTools, PreallocationTools
nz = 1000 # number of zones
nvars = 20 # number of variables
xvalues = rand(nz*nvars)
yvalues = zeros(nz*nvars)
## Create a block-tridiagonal matrix to store the Jacobian
l, u = 1, 1 # block bandwidths
N = M = nz # number of row/column blocks
cols = rows = [nvars for i = 1:N] # block size200
using BlockBandedMatrices, ForwardDiff, Base.Threads
jacobian = BlockBandedMatrix(Zeros(sum(rows), sum(cols)), rows, cols, (l, u))
# For nz=4, nvars=2, this looks like this
# 4×4-blocked 8×8 BlockBandedMatrix{Float64}:
# 0.0 0.0 │ 0.0 0.0 │ ⋅ ⋅ │ ⋅ ⋅
# 0.0 0.0 │ 0.0 0.0 │ ⋅ ⋅ │ ⋅ ⋅
# ──────────┼────────────┼────────────┼──────────
# 0.0 0.0 │ 0.0 0.0 │ 0.0 0.0 │ ⋅ ⋅
# 0.0 0.0 │ 0.0 0.0 │ 0.0 0.0 │ ⋅ ⋅
# ──────────┼────────────┼────────────┼──────────
# ⋅ ⋅ │ 0.0 0.0 │ 0.0 0.0 │ 0.0 0.0
# ⋅ ⋅ │ 0.0 0.0 │ 0.0 0.0 │ 0.0 0.0
# ──────────┼────────────┼────────────┼──────────
# ⋅ ⋅ │ ⋅ ⋅ │ 0.0 0.0 │ 0.0 0.0
# ⋅ ⋅ │ ⋅ ⋅ │ 0.0 0.0 │ 0.0 0.0
##store sum of all values and their partials at each cell
xmock = zeros(nvars)
sum_diff_results = [DiffResults.GradientResult(xmock) for k in 1:nz]
grad_equ_wrappers = [x->sum(x) for k in 1:nz]
gradient_cfgs = [ForwardDiff.GradientConfig(grad_equ_wrappers[k], xmock) for k in 1:nz]
diff_cache_array = [DiffCache(zeros((k==nz || k==1) ? 2 : 3),(k==nz || k==1) ? 2*nvars : 3*nvars) for k=1:nz]
##
function compute_sums_and_partials(xvalues, sum_diff_results, nz,nvars, equ_wrappers, gradient_cfgs, diff_cache_array)
Threads.@threads for k in 1:nz
ForwardDiff.gradient!(sum_diff_results[k], equ_wrappers[k],view(xvalues,((k-1)*nvars+1):k*nvars), gradient_cfgs[k])
end
Threads.@threads for k in 2:nz-1
diff_cache_array[k].du[1] = DiffResults.value(sum_diff_results[k-1])
diff_cache_array[k].du[2] = DiffResults.value(sum_diff_results[k])
diff_cache_array[k].du[3] = DiffResults.value(sum_diff_results[k+1])
diff_cache_array[k].dual_du[1] = DiffResults.value(sum_diff_results[k-1])
diff_cache_array[k].dual_du[2:1+nvars] = DiffResults.gradient(sum_diff_results[k-1])
diff_cache_array[k].dual_du[2+nvars:1+3*nvars] .= 0
diff_cache_array[k].dual_du[2+3*nvars] = DiffResults.value(sum_diff_results[k])
diff_cache_array[k].dual_du[3+3*nvars:2+4*nvars] .= 0
diff_cache_array[k].dual_du[3+4*nvars:2+5*nvars] = DiffResults.gradient(sum_diff_results[k])
diff_cache_array[k].dual_du[3+5*nvars:2+6*nvars] .= 0
diff_cache_array[k].dual_du[3+6*nvars] = DiffResults.value(sum_diff_results[k+1])
diff_cache_array[k].dual_du[4+6*nvars:3+8*nvars] .= 0
diff_cache_array[k].dual_du[4+8*nvars:3+9*nvars] = DiffResults.gradient(sum_diff_results[k+1])
end
diff_cache_array[1].du[1] = DiffResults.value(sum_diff_results[1])
diff_cache_array[1].du[2] = DiffResults.value(sum_diff_results[2])
diff_cache_array[1].dual_du[1] = DiffResults.value(sum_diff_results[1])
diff_cache_array[1].dual_du[2:1+nvars] = DiffResults.gradient(sum_diff_results[1])
diff_cache_array[1].dual_du[2+nvars:1+2*nvars] .= 0
diff_cache_array[1].dual_du[2+2*nvars] = DiffResults.value(sum_diff_results[2])
diff_cache_array[1].dual_du[3+2*nvars:2+3*nvars] .= 0
diff_cache_array[1].dual_du[3+3*nvars:2+4*nvars] = DiffResults.gradient(sum_diff_results[2])
diff_cache_array[nz].du[1] = DiffResults.value(sum_diff_results[end-1])
diff_cache_array[nz].du[2] = DiffResults.value(sum_diff_results[end])
diff_cache_array[nz].dual_du[1] = DiffResults.value(sum_diff_results[end-1])
diff_cache_array[nz].dual_du[2:1+nvars] = DiffResults.gradient(sum_diff_results[end-1])
diff_cache_array[nz].dual_du[2+nvars:1+2*nvars] .= 0
diff_cache_array[nz].dual_du[2+2*nvars] = DiffResults.value(sum_diff_results[end])
diff_cache_array[nz].dual_du[3+2*nvars:2+3*nvars] .= 0
diff_cache_array[nz].dual_du[3+3*nvars:2+4*nvars] = DiffResults.gradient(sum_diff_results[end])
end
@benchmark compute_sums_and_partials($xvalues, $sum_diff_results, $nz, $nvars, grad_equ_wrappers, $gradient_cfgs, diff_cache_array)
## Make a mock set of equations to be solved for each row
function eval_cell_eqs!(y, x, k, nz, nvars, diff_cache_array)
# k is the cell for which I am computing the equations
# y = [y_1,y_2] are the discretized equations I want to solve.
# x are the independent variables in the current and neighboring cells,
# for example for k=2 I would have [T1,P1,T2,P2,T3,P3]. For edge cases
# I only have the relevant subset of variables, eg for k=1, x=[T1,P1,T2,P2]
sums = get_tmp(diff_cache_array[k],y[1])
for i in 1:nvars
if k==1 || k==nz
y[i] = sums[1]+sums[2]
else
y[i] = sums[1]+sums[2]+sums[3]
end
end
end
eval_cell_eqs!(zeros(nvars), rand(3*nvars), 2, nz, nvars, diff_cache_array)
## Create a function that computes a single row of the jacobian in-place
function eval_jacobian_row!(xvalues, yvalues, jacobian, k, nz, nvars, equ_wrapper, jac_cfg)
# we construct a view for the segment of xvalues and yvalues we need for cell k
# we do the same for the jacobian
ki = 0
kf = 0
if k==1
ki = nvars*(k-1)+1
kf = nvars*(k+1)
elseif k==nz
ki = nvars*(k-2)+1
kf = nvars*(k)
else
ki = nvars*(k-2)+1
kf = nvars*(k+1)
end
jac_view = view(jacobian,nvars*(k-1)+1:nvars*k,ki:kf)
yvalues_view = view(yvalues, nvars*(k-1)+1:nvars*k)
xvalues_view = view(xvalues, ki:kf)
ForwardDiff.jacobian!(jac_view, equ_wrapper,
yvalues_view, xvalues_view, jac_cfg)
end
## This function parallelizes the calculation of each row
function eval_jacobian!(xvalues, yvalues, jacobian, nz, nvars, equ_wrappers, jac_cfgs)
#for k in 1:nz
Threads.@threads for k in 1:nz
eval_jacobian_row!(xvalues, yvalues, jacobian, k, nz, nvars, equ_wrappers[k], jac_cfgs[k])
end
end
##compute the Jacobian
equ_wrappers = [(y,x)->eval_cell_eqs!(y, x, k, nz, nvars, diff_cache_array) for k in 1:nz]
function xsize(k,nz,nvars)
if k==1 || k==nz
return 2*nvars
else
return 3*nvars
end
end
jac_cfgs = [ForwardDiff.JacobianConfig(equ_wrappers[k], zeros(nvars), zeros(xsize(k,nz,nvars)), ForwardDiff.Chunk{(k==1||k==nz ? 2*nvars : 3*nvars)}()) for k in 1:nz]
eval_jacobian!(xvalues, yvalues, jacobian, nz, nvars, equ_wrappers, jac_cfgs)
jacobian
# For nz=4, nvars=2, This one looks like this:
# 4×4-blocked 8×8 BlockBandedMatrix{Float64}:
# 1.0 1.0 │ 1.0 1.0 │ ⋅ ⋅ │ ⋅ ⋅
# 1.0 1.0 │ 1.0 1.0 │ ⋅ ⋅ │ ⋅ ⋅
# ──────────┼────────────┼────────────┼──────────
# 1.0 1.0 │ 1.0 1.0 │ 1.0 1.0 │ ⋅ ⋅
# 1.0 1.0 │ 1.0 1.0 │ 1.0 1.0 │ ⋅ ⋅
# ──────────┼────────────┼────────────┼──────────
# ⋅ ⋅ │ 1.0 1.0 │ 1.0 1.0 │ 1.0 1.0
# ⋅ ⋅ │ 1.0 1.0 │ 1.0 1.0 │ 1.0 1.0
# ──────────┼────────────┼────────────┼──────────
# ⋅ ⋅ │ ⋅ ⋅ │ 1.0 1.0 │ 1.0 1.0
# ⋅ ⋅ │ ⋅ ⋅ │ 1.0 1.0 │ 1.0 1.0
##
##
using BenchmarkTools
@benchmark begin
compute_sums_and_partials($xvalues, $sum_diff_results, $nz, $nvars, grad_equ_wrappers, $gradient_cfgs, $diff_cache_array)
eval_jacobian!($xvalues, $yvalues, $jacobian, $nz, $nvars, $equ_wrappers, $jac_cfgs)
end
And here's the benchmark for this one:
This one has two issues:
So in the end this only provides a small optimization. I feel there's still a steep prize simply in allocation costs...But this does provide a way to precompute values.
And one other example, using SparseDiffTools. This also produces a block tridiagonal jacobian filled with ones.
using BlockBandedMatrices, ForwardDiff, Base.Threads
fcalls = 0
function f(y,x,nz,nvars) # in-place
global fcalls += 1
for i in 1:nz
for l in 1:nvars
index = (i-1)*nvars
y[index+l] = 0
if i==1
for j in 1:2*nvars
y[index+l] = y[index+l] + x[j]
end
elseif i==nz
for j in 0:2*nvars-1
y[index+l] = y[index+l] + x[end-j]
end
else
for j in (i-2)*nvars+1:(i+1)*nvars
y[index+l] = y[index+l] + x[j]
end
end
end
end
nothing
end
## construct sparse jacobian matrix
#using Symbolics
nz = 1000 # number of zones
nvars = 20 # number of variables
input = rand(nz*nvars)
output = similar(input)
#sparsity_pattern = Symbolics.jacobian_sparsity((y,x)->f(y,x,nz,nvars),output,input)
#jac = Float64.(sparsity_pattern)
# Create a block-tridiagonal matrix to store the Jacobian
l, u = 1, 1 # block bandwidths
N = M = nz # number of row/column blocks
cols = rows = [nvars for i = 1:N] # block sizes
jacobian = BlockBandedMatrix(Zeros(sum(rows), sum(cols)), rows, cols, (l, u))
## get colorvec of jac
using SparseDiffTools
colors = matrix_colors(jacobian)
## compute jacobian
equ = (y,x)->f(y,x,nz,nvars)
using BenchmarkTools
@benchmark forwarddiff_color_jacobian!($jacobian, $equ, $input, colorvec = $colors)
#forwarddiff_color_jacobian!(jacobian, equ, input, colorvec = colors)
This is slower and more allocation heavy. But it is not parallalized (examples in the comment above use 8 threads).
This has become a bit too general, better to make sub-issues
Although we have a very nice flexible system to solve equations right now, I think it is pretty wasteful of resources (and thus not as fast as I'd like). In NuclearBurning.jl the computation of each row of the Jacobian requires about 40kb of allocation. That is too high for my taste and a potential source of significant slowdown, rather than the repeated calculation of EOS/kap (or it could be the repeated allocation of EOS and kap results). I'm not sure how to preallocate these things in a clean way though, or if they're even the cause of the problem. If you want to spend a lot of time on this, the profiling tools of julia are excellent. Summarizing how to use that below. In NuclearBurning.jl you can try adding a bit of code as follows:
In vscode @profview will give you what is called a flame graph, that shows runtime spent in different parts of the code. You might see a lot of cruft associated with the profiling itself, but you can filter those out by pressing parts of the graph relevant to the code.
Profile.Allocs.@profile on the other hand is used to analyze where allocations are made. This can be a bit trickier to read, the output is accessed through the browser and you can also get a flame graph (check the "view" menu, you can also check source code lines in "source" and filter through total allocation space rather than number in "sample").
Regarding pre-allocating space for computed quantities, it can be tricky. The calculation of the Jacobian operates on chunks (does not do all derivatives at once) to save a bit of memory. If we pre-compute everything, we could only do so for one chunk. So we would need to force the chunk size to include all variables (using JacobianConfig). It could be that memory wise that is not so awful. Even then its still tricky to set pre-allocation with the right types, but I think it can be done after playing with it a bit. But nevertheless, I'm still not sure if the main bottleneck is the allocation of all these quantities.
Other source of potential problems is how the jacobian is stored. We are currently using a SparseMatrix, which has absolutely no idea about the tridiagonal structure (It just stores all quantities and locations). I mostly ended up using it because LinearSolvers can take them. The other option is using a BlockBandedMatrix, which contains the full tri-diagonal structure. BlockBandedMatrix is actually used right now in the StellarModel constructor but only to setup the non-zero components of the jacobian in the sparse matrix. One could instead just keep around the BlockBandedMatrix by defining StellarModel as:
This will break the solver, as BlockBandedMatrix is not compatible with LinearSolvers, but it can be used to see how this impacts performance. I thought this would make a difference, but at least in my tests it seems evaluating a row of the jacobian takes pretty much the same amount of time whether we use a SparseArray or a BlockBandedMatrix.
Another final (more dramatic) alternative to how we do things is to forgo the full calculation of the jacobian and a standard Newton solver. This is likely to be way more memory efficient for very large numbers of variables (ie networks). The idea here is just to get Jacobian vector product operators (see SparseDiffTools.jl), which are then used by iterative solvers (where an iteration here means an iteration to solve the linearized system, not the full non-linear system). You have a bunch of methods that have been devised to solve linear Ax=b problems (Newton-Krylov solvers) which can in principle work without explicitly computing the entire Jacobian. There is a package that provides these methods (IterativeSolvers.jl) which I think can work from the output of the jacobian vector product functions produced by SparseDiffTools.jl.