JuliaNLSolvers / Optim.jl

Optimization functions for Julia
Other
1.13k stars 221 forks source link

Output sparsity pattern of SparseMatrixCSC depends on autodiff option (finite, forward, backward/flux) #964

Open learning-chip opened 2 years ago

learning-chip commented 2 years ago

When optimizing a function with SparseMatrixCSC input type, the output ("minimizer") is still a sparse matrix type, but with inconsistent sparse patterns depending on the autodiff option:

  1. With autodiff = :finite (default), the output is a SparseMatrixCSC type but with dense patterns -- mathematically same as dense matrix, no zero entries.
  2. With autodiff = :forward, the output has the same sparsity pattern as the original input -- zero entries are kept zero. This seems the most reasonable behavior.
  3. With Flux/Zygote via FluxOptTools, the sparsity pattern is kept (at least mathematically). The software behavior is a bit weird though -- minimizer() returns a flatten 1-D array of matrix values, not SparseMatrixCSC type. The gradient computation itself should be correct (ref https://github.com/FluxML/Zygote.jl/issues/163#issuecomment-987535974).

Should such inconsistent behaviors be considered "expected" or "buggy"?

To reproduce:

using Optim, SparseArrays, LinearAlgebra, Random
using Zygote, Flux, FluxOptTools

n = 8
I_n = spdiagm(ones(n))
Random.seed!(0)
A = sprand(n, n, 0.5) * 0.2 + I_n
B = sprand(n, n, 0.5) * 0.2 + I_n

f(P) = norm(A * P - I_n)  # f is zero only when P = inv(A)

B_opt_fd = Optim.minimizer(optimize(f, B, BFGS(), autodiff = :finite))  # output dense patterns
B_opt_ad = Optim.minimizer(optimize(f, B, BFGS(), autodiff = :forward))  # output sparse patterns

# the rest are specific to Flux-computed gradient

function optimize_flux(B)
    # can also compute gradient explicitly by `f_grad(P) = gradient(f, P)[1]` and pass as `g!` to Optim, 
    # but FluxOptTools simplifies the array flattening for us
    B_opt = copy(B)  # avoid in-place modification to original B
    loss() = f(B_opt)
    Zygote.refresh()
    pars   = Flux.params(B_opt)
    lossfun, gradfun, fg!, p0 = optfuns(loss, pars)
    B_vals = Optim.minimizer(optimize(Optim.only_fg!(fg!), p0, BFGS()))
    return B_opt, B_vals
end

B_opt_flux, B_vals_flux = optimize_flux(B)  # B_opt_flux has same sparsity as B, but with values modified
isapprox(B_opt_flux, B_opt_ad)  # results agree
isequal(filter(!iszero, B_vals_flux), B_opt_flux.nzval)  #  B_vals_flux is a flatten, dense 1-D array

As for the result,

Package Version

longemen3000 commented 2 years ago

I think that the problem is not in Optim.jl, but in NLSolversBase.jl. i can reproduce by:

using NLSolversBase, SparseArrays,Random,LinearAlgebra
n = 8
I_n = spdiagm(ones(n))
Random.seed!(0)
A = sprand(n, n, 0.5) * 0.2 + I_n
B = sprand(n, n, 0.5) * 0.2 + I_n
f(P) = norm(A * P - I_n) 
FD = OnceDifferentiable(f, B , autodiff = :finite)
AD =  OnceDifferentiable(f, B , autodiff = :forward)
dB_ad = copy(B)
dB_fd = copy(B)
AD.df(dB_ad,B);dB_ad
FD.df(dB_fd,B);dB_fd

its almost definitively related to FiniteDiff.jl (the finite difference backend). but i need to confirm

longemen3000 commented 2 years ago

a reproducer without Optim.jl

using FiniteDiff, ForwardDiff, SparseArrays,Random,LinearAlgebra
n = 8
I_n = spdiagm(ones(n))
Random.seed!(0)
A = sprand(n, n, 0.5) * 0.2 + I_n
B = sprand(n, n, 0.5) * 0.2 + I_n
f(P) = norm(A * P - I_n)
FiniteDiff.finite_difference_gradient(f,B)
ForwardDiff.gradient(f,B)

NLSolversBase (and Optim) pins the version of FiniteDiff to 2.0, but it's not relevant (the error occurs on 2.8.1)

pkofod commented 2 years ago

Interesting. I think you'd have to use sparsedifftools.jl instead of finitediff.jl to preserve sparsity patterns. @ChrisRackauckas ?

ChrisRackauckas commented 2 years ago

FiniteDiff.jl has the same sparsity API as SparseDiffTools.jl. Just pass the color vector and the sparse Jacobian. SparseDiffTools has the AD version of sparse differentiation, along with the graph algorithms for computing the color vectors, while FiniteDiff.jl has a compatible finite difference Jacobian implementation.

ChrisRackauckas commented 2 years ago

Reading this more, @longemen3000 did not pass the sparsity to FiniteDiff, and so it did not know that the Jacobian it was calculating was supposed to be sparse. Having a sparse input does not necessarily mean a sparse output (example: sparse initial condition to an ODE or neural network will become dense), and so that is not sufficient information for any such differentiation code to determine what the sparsity pattern the derivative would have. And we haven't implemented sparsity on gradients since that's a rather niche case as at most it would be a sparse vector, compared to sparse Jacobians or Hessians which would be matrices.