General optimization - Githubissues

Although we have a very nice flexible system to solve equations right now, I think it is pretty wasteful of resources (and thus not as fast as I'd like). In NuclearBurning.jl the computation of each row of the Jacobian requires about 40kb of allocation. That is too high for my taste and a potential source of significant slowdown, rather than the repeated calculation of EOS/kap (or it could be the repeated allocation of EOS and kap results). I'm not sure how to preallocate these things in a clean way though, or if they're even the cause of the problem. If you want to spend a lot of time on this, the profiling tools of julia are excellent. Summarizing how to use that below. In NuclearBurning.jl you can try adding a bit of code as follows:

##Initialize StellarModel and evaluate equations and jacobian
Evolution.n1_polytrope_initial_condition(sm, MSUN, 100*RSUN; initial_dt=10*SECYEAR)

Evolution.set_end_step_info!(sm)
Evolution.cycle_step_info!(sm)
Evolution.set_start_step_info!(sm)

Evolution.eval_jacobian!(sm)
Evolution.eval_eqs!(sm)

##
@profview for i in 1:100000
    Evolution.eval_jacobian_row!(sm,10)
end

##
using PProf
using Profile
Profile.Allocs.@profile for i in 1:10000
    Evolution.eval_jacobian_row!(sm,10)
end
PProf.Allocs.pprof(from_c=false)

In vscode @profview will give you what is called a flame graph, that shows runtime spent in different parts of the code. You might see a lot of cruft associated with the profiling itself, but you can filter those out by pressing parts of the graph relevant to the code.

Profile.Allocs.@profile on the other hand is used to analyze where allocations are made. This can be a bit trickier to read, the output is accessed through the browser and you can also get a flame graph (check the "view" menu, you can also check source code lines in "source" and filter through total allocation space rather than number in "sample").

Regarding pre-allocating space for computed quantities, it can be tricky. The calculation of the Jacobian operates on chunks (does not do all derivatives at once) to save a bit of memory. If we pre-compute everything, we could only do so for one chunk. So we would need to force the chunk size to include all variables (using JacobianConfig). It could be that memory wise that is not so awful. Even then its still tricky to set pre-allocation with the right types, but I think it can be done after playing with it a bit. But nevertheless, I'm still not sure if the main bottleneck is the allocation of all these quantities.

Other source of potential problems is how the jacobian is stored. We are currently using a SparseMatrix, which has absolutely no idea about the tridiagonal structure (It just stores all quantities and locations). I mostly ended up using it because LinearSolvers can take them. The other option is using a BlockBandedMatrix, which contains the full tri-diagonal structure. BlockBandedMatrix is actually used right now in the StellarModel constructor but only to setup the non-zero components of the jacobian in the sparse matrix. One could instead just keep around the BlockBandedMatrix by defining StellarModel as:

"""
    mutable struct StellarModel

An evolutionary model for a star, containing information about the star's current state, as well as the independent 
variables of the model and its equations.
""" 
@kwdef mutable struct StellarModel
    # Properties that define the model
    ind_vars::Vector{<:Real}  # List of independent variables
    varnames::Vector{Symbol}  # List of variable names
    vari::Dict{Symbol,Int}  # Maps variable names to ind_vars vector
    eqs::Vector{<:Real}  # Stores the results of the equation evaluations
    nvars::Int  # This is the sum of hydro vars and species
    nspecies::Int  # Just the number of species in the network
    structure_equations::Vector{Function}  # List of equations to be solved. Be careful, typing here probably kills type inference

    # Grid properties
    nz::Int  # Number of zones in the model
    m::Vector{<:Real}  # Mass coordinate of each cell (g)
    dm::Vector{<:Real}  # Mass contained in each cell (g)
    mstar::Real  # Total model mass (g)

    # Unique valued properties (ie not cell dependent)
    time::Real  # Age of the model (s)
    dt::Real  # Timestep of the current evolutionary step (s)
    model_number::Int

    # Some basic info
    eos::EOS.AbstractEOS
    opacity::Opacity.AbstractOpacity
    isotope_data::Dict{Symbol, Isotope}

    # Jacobian matrix
    jacobian::BlockBandedMatrix{Float64}#SparseMatrixCSC{Float64, Int64}
    #linear_solver #solver that is produced by LinearSolve

    # Here I want to preemt things that will be necessary once we have an adaptative mesh.
    # Idea is that psi contains the information from the previous step (or the initial condition).
    # ssi will contain information after remeshing. Absent remeshing it will make no difference.
    # esi will contain properties once the step is completed.
    # Information coming from the previous step (psi=Previous Step Info)
    psi::StellarStepInfo
    # Information computed at the start of the step (ssi=Start Step Info)
    ssi::StellarStepInfo
    # Information computed at the end of the step (esi=End Step Info)
    esi::StellarStepInfo

    # Space for used defined options, defaults are in Options.jl
    opt::Options
end

"""
    StellarModel(varnames::Vector{Symbol}, structure_equations::Vector{Function}, nvars::Int, 
    nspecies::Int, nz::Int, eos::AbstractEOS, opacity::AbstractOpacity)

Constructor for a `StellarModel` instance, using `varnames` for the independent variables, functions of the 
`structure_equations` to be solved, number of independent variables `nvars`, number of species in the network `nspecies`
number of zones in the model `nz` and an iterface to the EOS and Opacity laws.
"""
function StellarModel(varnames::Vector{Symbol}, structure_equations::Vector{Function}, nvars::Int, nspecies::Int, 
        nz::Int, eos::AbstractEOS, opacity::AbstractOpacity)
    ind_vars = ones(nvars*nz)
    eqs = ones(nvars*nz)
    m = ones(nz)

    l,u = 1,1  # block bandwidths
    N = M = nz  # number of row/column blocks
    cols = rows = [nvars for i in 1:N]  # block sizes

    jac_BBM = BlockBandedMatrix(Ones(sum(rows),sum(cols)), rows,cols, (l,u))
    jacobian = jac_BBM#sparse(jac_BBM)
    #create solver
    problem = LinearProblem(jacobian, eqs)
    #linear_solver = init(problem)

    isotope_data = Chem.get_isotope_list();

    vari::Dict{Symbol,Int} = Dict()
    for i in eachindex(varnames)
        vari[varnames[i]] = i
    end

    dm = zeros(nz)
    m = zeros(nz)

    psi = StellarStepInfo(nz=nz, m=zeros(nz), dm=zeros(nz), mstar=0.0, time=0.0, dt=0.0, model_number=0, ind_vars=zeros(nvars*nz),
                          lnT=zeros(nz), L=zeros(nz), lnP=zeros(nz), lnρ=zeros(nz), lnr=zeros(nz))
    ssi = StellarStepInfo(nz=nz, m=zeros(nz), dm=zeros(nz), mstar=0.0, time=0.0, dt=0.0, model_number=0, ind_vars=zeros(nvars*nz),
                          lnT=zeros(nz), L=zeros(nz), lnP=zeros(nz), lnρ=zeros(nz), lnr=zeros(nz))
    esi = StellarStepInfo(nz=nz, m=zeros(nz), dm=zeros(nz), mstar=0.0, time=0.0, dt=0.0, model_number=0, ind_vars=zeros(nvars*nz),
                          lnT=zeros(nz), L=zeros(nz), lnP=zeros(nz), lnρ=zeros(nz), lnr=zeros(nz))

    opt = Options()

    StellarModel(ind_vars=ind_vars,
        varnames=varnames,
        eqs=eqs,
        nvars=nvars, nspecies=nspecies,
        structure_equations=structure_equations,
        vari=vari,
        nz=nz, m=m, dm=dm, mstar=0.0,
        time=0.0, dt=0.0, model_number=0,
        eos=eos,opacity=opacity,isotope_data=isotope_data,
        jacobian=jacobian,#linear_solver=linear_solver,
        psi=psi,ssi=ssi,esi=esi,
        opt=opt)
end

This will break the solver, as BlockBandedMatrix is not compatible with LinearSolvers, but it can be used to see how this impacts performance. I thought this would make a difference, but at least in my tests it seems evaluating a row of the jacobian takes pretty much the same amount of time whether we use a SparseArray or a BlockBandedMatrix.

Another final (more dramatic) alternative to how we do things is to forgo the full calculation of the jacobian and a standard Newton solver. This is likely to be way more memory efficient for very large numbers of variables (ie networks). The idea here is just to get Jacobian vector product operators (see SparseDiffTools.jl), which are then used by iterative solvers (where an iteration here means an iteration to solve the linearized system, not the full non-linear system). You have a bunch of methods that have been devised to solve linear Ax=b problems (Newton-Krylov solvers) which can in principle work without explicitly computing the entire Jacobian. There is a package that provides these methods (IterativeSolvers.jl) which I think can work from the output of the jacobian vector product functions produced by SparseDiffTools.jl.

Just want to summarize some random experiments I've been trying out. For simplicity, doing this completely independent of the code in Jems. I'm just using a simpler system that will produce a block tri-diagonal matrix filled with ones. The code below shows an example of this following a similar structure of what we do in Solver.jl

using BlockBandedMatrices, ForwardDiff, Base.Threads

nz = 1000 # number of zones
nvars = 20 # number of variables

## Create a block-tridiagonal matrix to store the Jacobian
l, u = 1, 1  # block bandwidths
N = M = nz  # number of row/column blocks
cols = rows = [nvars for i = 1:N]  # block sizes
jacobian = BlockBandedMatrix(Zeros(sum(rows), sum(cols)), rows, cols, (l, u))
# For nz=4, nvars=2, this looks like this
# 4×4-blocked 8×8 BlockBandedMatrix{Float64}:
#  0.0  0.0  │  0.0  0.0  │   ⋅    ⋅   │   ⋅    ⋅ 
#  0.0  0.0  │  0.0  0.0  │   ⋅    ⋅   │   ⋅    ⋅ 
#  ──────────┼────────────┼────────────┼──────────
#  0.0  0.0  │  0.0  0.0  │  0.0  0.0  │   ⋅    ⋅ 
#  0.0  0.0  │  0.0  0.0  │  0.0  0.0  │   ⋅    ⋅ 
#  ──────────┼────────────┼────────────┼──────────
#   ⋅    ⋅   │  0.0  0.0  │  0.0  0.0  │  0.0  0.0
#   ⋅    ⋅   │  0.0  0.0  │  0.0  0.0  │  0.0  0.0
#  ──────────┼────────────┼────────────┼──────────
#   ⋅    ⋅   │   ⋅    ⋅   │  0.0  0.0  │  0.0  0.0
#   ⋅    ⋅   │   ⋅    ⋅   │  0.0  0.0  │  0.0  0.0

## Make a mock set of equations to be solved for each row
function eval_cell_eqs!(y, x, k, nz, nvars)
    # k is the cell for which I am computing the equations
    # y = [y_1,y_2] are the discretized equations I want to solve.
    # x are the independent variables in the current and neighboring cells,
    # for example for k=2 I would have [T1,P1,T2,P2,T3,P3]. For edge cases
    # I only have the relevant subset of variables, eg for k=1, x=[T1,P1,T2,P2]
    for i in 1:nvars
        y[i] = 0
        if k==1 || k==nz
            for j in 1:2*nvars
                y[i] = y[i] + x[j]
            end
        else
            for j in 1:3*nvars
                y[i] = y[i] + x[j]
            end
        end
    end
end

## Create a function that computes a single row of the jacobian in-place
function eval_jacobian_row!(xvalues, yvalues, jacobian, k, nz, nvars, equ_wrapper, jac_cfg)
    # we construct a view for the segment of xvalues and yvalues we need for cell k
    # we do the same for the jacobian
    ki = 0
    kf = 0
    if k==1
        ki = nvars*(k-1)+1
        kf = nvars*(k+1)
    elseif k==nz
        ki = nvars*(k-2)+1
        kf = nvars*(k)
    else
        ki = nvars*(k-2)+1
        kf = nvars*(k+1)
    end

    jac_view = view(jacobian,nvars*(k-1)+1:nvars*k,ki:kf)
    yvalues_view = view(yvalues, nvars*(k-1)+1:nvars*k)
    xvalues_view = view(xvalues, ki:kf)

    ForwardDiff.jacobian!(jac_view, equ_wrapper, 
                        yvalues_view, xvalues_view, jac_cfg)
end

## This function parallelizes the calculation of each row
function eval_jacobian!(xvalues, yvalues, jacobian, nz, nvars, equ_wrappers, jac_cfgs)
    #for k in 1:nz
    Threads.@threads for k in 1:nz
        eval_jacobian_row!(xvalues, yvalues, jacobian, k, nz, nvars, equ_wrappers[k], jac_cfgs[k])
    end
end

##initialize xvalues, yvalues, and compute the Jacobian
xvalues = rand(nz*nvars)
yvalues = zeros(nz*nvars)
equ_wrappers = [(y,x)->eval_cell_eqs!(y, x, k, nz, nvars) for k in 1:nz]
function xsize(k,nz,nvars)
    if k==1 || k==nz
        return 2*nvars
    else
        return 3*nvars
    end
end
jac_cfgs = [ForwardDiff.JacobianConfig(equ_wrappers[k], zeros(nvars), zeros(xsize(k,nz,nvars)), ForwardDiff.Chunk{(k==1||k==nz ? 2*nvars : 3*nvars)}()) for k in 1:nz]
eval_jacobian!(xvalues, yvalues, jacobian, nz, nvars, equ_wrappers, jac_cfgs)
jacobian
# For nz=4, nvars=2, This one looks like this:
# 4×4-blocked 8×8 BlockBandedMatrix{Float64}:
#  1.0  1.0  │  1.0  1.0  │   ⋅    ⋅   │   ⋅    ⋅ 
#  1.0  1.0  │  1.0  1.0  │   ⋅    ⋅   │   ⋅    ⋅ 
#  ──────────┼────────────┼────────────┼──────────
#  1.0  1.0  │  1.0  1.0  │  1.0  1.0  │   ⋅    ⋅ 
#  1.0  1.0  │  1.0  1.0  │  1.0  1.0  │   ⋅    ⋅ 
#  ──────────┼────────────┼────────────┼──────────
#   ⋅    ⋅   │  1.0  1.0  │  1.0  1.0  │  1.0  1.0
#   ⋅    ⋅   │  1.0  1.0  │  1.0  1.0  │  1.0  1.0
#  ──────────┼────────────┼────────────┼──────────
#   ⋅    ⋅   │   ⋅    ⋅   │  1.0  1.0  │  1.0  1.0
#   ⋅    ⋅   │   ⋅    ⋅   │  1.0  1.0  │  1.0  1.0

##
using BenchmarkTools
@benchmark eval_jacobian!($xvalues, $yvalues, $jacobian, $nz, $nvars, $equ_wrappers, $jac_cfgs)

In the equations I'm just adding up over all variables at the current and neighboring cells, so I get the block tri-diagonal structure with ones. I am fixing the chunk size (ForwardDiff.Chunk{(k==1||k==nz ? 2*nvars : 3*nvars)}()) in order to compare with a different version of this code. The resulting benchmark for eval_jacobian gives this on my laptop: 1000x20_old_parallel

Now there two problems with how things are done here:

Calculations are a bit wasted as for each row of the jacobian I compute the sum of all variables in the current and neighboring cells. This means I will recompute those sums about three times more than needed (not exactly three because the sums of edge values are only computed twice). This is similar to the repeated EOS calculations we have.
I can't manage to get rid of allocations in the calculation of the Jacobian.

Now, I found a way to precompute those repeated sums. It involves using the facilities provided by the PreallocationTools package. This allows me to define preallocated objects that are turned into dual numbers, allowing one to perform operations without constantly reallocating some things. Probably a bit dangerous, but what I'm doing is directly editing these preallocated objects before the call to ForwardDiff.jacobian. By doing that, when I compute the Jacobian I can access these preallocated objects and they will have the results I need, preventing the need of repeated computation. The example below is a bit long though:

using BlockBandedMatrices, ForwardDiff, Base.Threads, DiffResults, BenchmarkTools, PreallocationTools

nz = 1000 # number of zones
nvars = 20 # number of variables
xvalues = rand(nz*nvars)
yvalues = zeros(nz*nvars)

## Create a block-tridiagonal matrix to store the Jacobian
l, u = 1, 1  # block bandwidths
N = M = nz  # number of row/column blocks
cols = rows = [nvars for i = 1:N]  # block size200
using BlockBandedMatrices, ForwardDiff, Base.Threads
jacobian = BlockBandedMatrix(Zeros(sum(rows), sum(cols)), rows, cols, (l, u))
# For nz=4, nvars=2, this looks like this
# 4×4-blocked 8×8 BlockBandedMatrix{Float64}:
#  0.0  0.0  │  0.0  0.0  │   ⋅    ⋅   │   ⋅    ⋅ 
#  0.0  0.0  │  0.0  0.0  │   ⋅    ⋅   │   ⋅    ⋅ 
#  ──────────┼────────────┼────────────┼──────────
#  0.0  0.0  │  0.0  0.0  │  0.0  0.0  │   ⋅    ⋅ 
#  0.0  0.0  │  0.0  0.0  │  0.0  0.0  │   ⋅    ⋅ 
#  ──────────┼────────────┼────────────┼──────────
#   ⋅    ⋅   │  0.0  0.0  │  0.0  0.0  │  0.0  0.0
#   ⋅    ⋅   │  0.0  0.0  │  0.0  0.0  │  0.0  0.0
#  ──────────┼────────────┼────────────┼──────────
#   ⋅    ⋅   │   ⋅    ⋅   │  0.0  0.0  │  0.0  0.0
#   ⋅    ⋅   │   ⋅    ⋅   │  0.0  0.0  │  0.0  0.0

##store sum of all values and their partials at each cell
xmock = zeros(nvars)
sum_diff_results = [DiffResults.GradientResult(xmock) for k in 1:nz]
grad_equ_wrappers = [x->sum(x) for k in 1:nz]
gradient_cfgs = [ForwardDiff.GradientConfig(grad_equ_wrappers[k], xmock) for k in 1:nz]
diff_cache_array = [DiffCache(zeros((k==nz || k==1)  ? 2 : 3),(k==nz || k==1)  ? 2*nvars : 3*nvars) for k=1:nz]
##
function compute_sums_and_partials(xvalues, sum_diff_results, nz,nvars, equ_wrappers, gradient_cfgs, diff_cache_array)
    Threads.@threads for k in 1:nz
        ForwardDiff.gradient!(sum_diff_results[k], equ_wrappers[k],view(xvalues,((k-1)*nvars+1):k*nvars), gradient_cfgs[k])
    end
    Threads.@threads for k in 2:nz-1
        diff_cache_array[k].du[1] = DiffResults.value(sum_diff_results[k-1])
        diff_cache_array[k].du[2] = DiffResults.value(sum_diff_results[k])
        diff_cache_array[k].du[3] = DiffResults.value(sum_diff_results[k+1])

        diff_cache_array[k].dual_du[1] = DiffResults.value(sum_diff_results[k-1])
        diff_cache_array[k].dual_du[2:1+nvars] = DiffResults.gradient(sum_diff_results[k-1])
        diff_cache_array[k].dual_du[2+nvars:1+3*nvars] .= 0

        diff_cache_array[k].dual_du[2+3*nvars] = DiffResults.value(sum_diff_results[k])
        diff_cache_array[k].dual_du[3+3*nvars:2+4*nvars] .= 0
        diff_cache_array[k].dual_du[3+4*nvars:2+5*nvars] = DiffResults.gradient(sum_diff_results[k])
        diff_cache_array[k].dual_du[3+5*nvars:2+6*nvars] .= 0

        diff_cache_array[k].dual_du[3+6*nvars] = DiffResults.value(sum_diff_results[k+1])
        diff_cache_array[k].dual_du[4+6*nvars:3+8*nvars] .= 0
        diff_cache_array[k].dual_du[4+8*nvars:3+9*nvars] = DiffResults.gradient(sum_diff_results[k+1])
    end
    diff_cache_array[1].du[1] = DiffResults.value(sum_diff_results[1])
    diff_cache_array[1].du[2] = DiffResults.value(sum_diff_results[2])

    diff_cache_array[1].dual_du[1] = DiffResults.value(sum_diff_results[1])
    diff_cache_array[1].dual_du[2:1+nvars] = DiffResults.gradient(sum_diff_results[1])
    diff_cache_array[1].dual_du[2+nvars:1+2*nvars] .= 0

    diff_cache_array[1].dual_du[2+2*nvars] = DiffResults.value(sum_diff_results[2])
    diff_cache_array[1].dual_du[3+2*nvars:2+3*nvars] .= 0
    diff_cache_array[1].dual_du[3+3*nvars:2+4*nvars] = DiffResults.gradient(sum_diff_results[2])

    diff_cache_array[nz].du[1] = DiffResults.value(sum_diff_results[end-1])
    diff_cache_array[nz].du[2] = DiffResults.value(sum_diff_results[end])

    diff_cache_array[nz].dual_du[1] = DiffResults.value(sum_diff_results[end-1])
    diff_cache_array[nz].dual_du[2:1+nvars] = DiffResults.gradient(sum_diff_results[end-1])
    diff_cache_array[nz].dual_du[2+nvars:1+2*nvars] .= 0

    diff_cache_array[nz].dual_du[2+2*nvars] = DiffResults.value(sum_diff_results[end])
    diff_cache_array[nz].dual_du[3+2*nvars:2+3*nvars] .= 0
    diff_cache_array[nz].dual_du[3+3*nvars:2+4*nvars] = DiffResults.gradient(sum_diff_results[end])
end

@benchmark compute_sums_and_partials($xvalues, $sum_diff_results, $nz, $nvars, grad_equ_wrappers, $gradient_cfgs, diff_cache_array)

## Make a mock set of equations to be solved for each row
function eval_cell_eqs!(y, x, k, nz, nvars, diff_cache_array)
    # k is the cell for which I am computing the equations
    # y = [y_1,y_2] are the discretized equations I want to solve.
    # x are the independent variables in the current and neighboring cells,
    # for example for k=2 I would have [T1,P1,T2,P2,T3,P3]. For edge cases
    # I only have the relevant subset of variables, eg for k=1, x=[T1,P1,T2,P2]

    sums = get_tmp(diff_cache_array[k],y[1])

    for i in 1:nvars
        if k==1 || k==nz
            y[i] = sums[1]+sums[2]
        else
            y[i] = sums[1]+sums[2]+sums[3]
        end
    end
end
eval_cell_eqs!(zeros(nvars), rand(3*nvars), 2, nz, nvars, diff_cache_array)

## Create a function that computes a single row of the jacobian in-place
function eval_jacobian_row!(xvalues, yvalues, jacobian, k, nz, nvars, equ_wrapper, jac_cfg)
    # we construct a view for the segment of xvalues and yvalues we need for cell k
    # we do the same for the jacobian
    ki = 0
    kf = 0
    if k==1
        ki = nvars*(k-1)+1
        kf = nvars*(k+1)
    elseif k==nz
        ki = nvars*(k-2)+1
        kf = nvars*(k)
    else
        ki = nvars*(k-2)+1
        kf = nvars*(k+1)
    end

    jac_view = view(jacobian,nvars*(k-1)+1:nvars*k,ki:kf)
    yvalues_view = view(yvalues, nvars*(k-1)+1:nvars*k)
    xvalues_view = view(xvalues, ki:kf)

    ForwardDiff.jacobian!(jac_view, equ_wrapper, 
                        yvalues_view, xvalues_view, jac_cfg)
end

## This function parallelizes the calculation of each row
function eval_jacobian!(xvalues, yvalues, jacobian, nz, nvars, equ_wrappers, jac_cfgs)
    #for k in 1:nz
    Threads.@threads for k in 1:nz
        eval_jacobian_row!(xvalues, yvalues, jacobian, k, nz, nvars, equ_wrappers[k], jac_cfgs[k])
    end
end

##compute the Jacobian
equ_wrappers = [(y,x)->eval_cell_eqs!(y, x, k, nz, nvars, diff_cache_array) for k in 1:nz]
function xsize(k,nz,nvars)
    if k==1 || k==nz
        return 2*nvars
    else
        return 3*nvars
    end
end
jac_cfgs = [ForwardDiff.JacobianConfig(equ_wrappers[k], zeros(nvars), zeros(xsize(k,nz,nvars)), ForwardDiff.Chunk{(k==1||k==nz ? 2*nvars : 3*nvars)}()) for k in 1:nz]
eval_jacobian!(xvalues, yvalues, jacobian, nz, nvars, equ_wrappers, jac_cfgs)
jacobian
# For nz=4, nvars=2, This one looks like this:
# 4×4-blocked 8×8 BlockBandedMatrix{Float64}:
#  1.0  1.0  │  1.0  1.0  │   ⋅    ⋅   │   ⋅    ⋅ 
#  1.0  1.0  │  1.0  1.0  │   ⋅    ⋅   │   ⋅    ⋅ 
#  ──────────┼────────────┼────────────┼──────────
#  1.0  1.0  │  1.0  1.0  │  1.0  1.0  │   ⋅    ⋅ 
#  1.0  1.0  │  1.0  1.0  │  1.0  1.0  │   ⋅    ⋅ 
#  ──────────┼────────────┼────────────┼──────────
#   ⋅    ⋅   │  1.0  1.0  │  1.0  1.0  │  1.0  1.0
#   ⋅    ⋅   │  1.0  1.0  │  1.0  1.0  │  1.0  1.0
#  ──────────┼────────────┼────────────┼──────────
#   ⋅    ⋅   │   ⋅    ⋅   │  1.0  1.0  │  1.0  1.0
#   ⋅    ⋅   │   ⋅    ⋅   │  1.0  1.0  │  1.0  1.0

##

##
using BenchmarkTools
@benchmark begin
    compute_sums_and_partials($xvalues, $sum_diff_results, $nz, $nvars, grad_equ_wrappers, $gradient_cfgs, $diff_cache_array)
    eval_jacobian!($xvalues, $yvalues, $jacobian, $nz, $nvars, $equ_wrappers, $jac_cfgs)
end

And here's the benchmark for this one: 1000x20_precumputesum_parallel

This one has two issues:

It forces me to fix the chunk size of ForwardDiff to the number of variables. Otherwise the pre-storing of the sums would not work. That is why I used a fixed chunk size in the previous example to make a one-to-one comparison. This can be more wasteful of memory, but if we have enough memory and we pre-allocate everything it should be ok.
It still allocates as much as the previous example!

So in the end this only provides a small optimization. I feel there's still a steep prize simply in allocation costs...But this does provide a way to precompute values.

Jems-jl / Jems.jl

General optimization #21