Performance of tall Jacobians

MaximilianJHuber commented 4 years ago

Let H be a matrix, x a vector, and f! a matrix-valued function. To make this fit the interface, I allow H to be a vector and reshape H :

using SparseArrays, SparseDiffTools, BenchmarkTools, ProfileView
function f!(H, x)
    if length(size(H)) == 1
        H = reshape(H, 500000, 4)
    end
    M = size(H)[1]
    N = size(H)[2]

    for m in 1:M
        for n in 1:N
            if isodd(n)
                H[m, n] = 2*x[1]
            else
                H[m, n] = 3*x[2]
            end
        end
    end
end
H = zeros(500000, 4)
x = 5.0 * ones(200)
@views f!(H[:], x) # same as f!(H, x)

Of the 200 variables only the first two are relevant, but even they collapse to one color.

pattern = sparse([1, 3, 2, 4], [1, 1, 2, 2], 1, size(H, 2), length(x))
jac = Float64.(sparse(pattern))
colors = matrix_colors(jac)
pattern = kron(pattern, ones(size(H, 1)))
jac = Float64.(sparse(pattern))
cache = ForwardColorJacCache(f!, x, dx=H, colorvec=colors ,sparsity=pattern);
@btime forwarddiff_color_jacobian!(jac, f!, x, cache);

108.402 ms (4 allocations: 160 bytes)

It is still much faster than full ForwardDiff:

full_jac = Matrix(jac);
full_jac .= 0;
@time ForwardDiff.jacobian!(full_jac, f!, H, x);

2.270 s (6 allocations: 198.39 MiB)

but @profview forwarddiff_color_jacobian!(jac, f!, x, cache); shows that most time is spent in FiniteDiff._colorediteration!.

YingboMa commented 4 years ago

Here are some thoughts on optimizing FiniteDiff._colorediteration!. Currently, our iteration order is to linearly traverse the compressed Jacobian and to find the appropriate column in the uncompressed Jacobian to copy to. This could bad for cache locality because when the matrix is large and sparse, we want to minimize the number of times that we traverse the large uncompressed Jacobian. So, I suggest to linearly traverse the uncompressed Jacobian and to find the appropriate in compressed Jacobian to copy from.

MaximilianJHuber commented 4 years ago

But isn't traversing the compressed Jacobian slower than traversing the uncompressed?

I have no sense how cache locality works for sparse matrices, but to find an element in the sparse matrix a heuristic may be good before starting to traverse the rowval: the next element in the compressed Jacobian is likely to be the next element in the uncompressed, at least for the sparsity pattern I have.

MaximilianJHuber commented 4 years ago

I read a comment from @j-fu on this post. His ExtendableSparse.jl and the way he uses it to extract the Jacobian out of ForwardDiff into a sparse matrix may be relevant for this issue.

sjdaines commented 3 years ago

Partial fix in PR https://github.com/JuliaDiff/SparseDiffTools.jl/pull/146

This adds a fast path that avoids slow sparse indexing if J and sparsity are both SparseMatrixCSC with the same sparsity pattern.

Updated timing (Julia 1.6.2)

SparseDiffTools.jl v1.14.0:

julia> @btime forwarddiff_color_jacobian!($jac, $f!, $x, $cache);
  107.003 ms (4 allocations: 160 bytes)

With PR https://github.com/JuliaDiff/SparseDiffTools.jl/pull/146:

julia> @btime forwarddiff_color_jacobian!($jac, $f!, $x, $cache);
  10.153 ms (4 allocations: 160 bytes)

and @profview forwarddiff_color_jacobian!(jac, f!, x, cache); shows that the specialized SparseDiffTools. _colorediteration! is now approx 1/4 the total.

JuliaDiff / SparseDiffTools.jl

Performance of tall Jacobians #100