JuliaSmoothOptimizers / ADNLPModels.jl

Other
29 stars 12 forks source link

Useful packages for (sparse) differentiation? #229

Open gdalle opened 1 month ago

gdalle commented 1 month ago

Hey there friends of JSO!

@adrhill and I have been hard at work on two autodiff toolkits which you might find useful:

We would be happy to discuss potential uses, and integration with your ecosystem. We're having similar talks with Optimization.jl and the SciML organization at large.

Ping @amontoison @abelsiqueira since we've already chatted a few times :)

amontoison commented 1 month ago

@gdalle I propose to start by replacing SparseDiffTools.jl and Symbolics.jl by SparseConnectivityTracer.jl.

We just need to update sparse_sym.jl and sparse_diff_tools.jl.

gdalle commented 1 month ago

Sounds good! We're currently in the middle of a big refactor for SparseConnectivityTracer.jl, so it will have to wait a few days

adrhill commented 1 month ago

The refactor has been merged and SparseConnectivityTracer v0.4.0 has just been tagged, so it's ready to be broken by you folks! ๐Ÿ˜

PRs adding @test_broken tests are more than welcome.

amontoison commented 1 month ago

@adrhill @gdalle I started to interface SparseConnectivityTracer.jl quickly tonight and I have a few feedbacks:

MWE:

using SparseConnectivityTracer

y = zeros(3)
x = rand(3)

function J(y, x)
  y[1] = x[1]^2
  y[2] = 2 * x[1] * x[2]^2
  y[3] = sin(x[3])
  return y
end

S = jacobian_pattern(J, y, x)

x = rand(3)
y = rand(2)

f(x) = 2 * x[1] * x[2]^2
c(x) = [x[3]^3; x[3]^2]
S = hessian_pattern(f, x)

โ„“(x) = f(x) + dot(c(x), y)
S = hessian_pattern(โ„“, x)
ERROR: MethodError: no method matching conj(::SparseConnectivityTracer.HessianTracer{BitSet, Set{Tuple{Int64, Int64}}})

Closest candidates are:
  conj(::Missing)
   @ Base missing.jl:101
  conj(::Complex)
   @ Base complex.jl:282
  conj(::SymTridiagonal)
   @ LinearAlgebra ~/julia/julia-1.10.3/share/julia/stdlib/v1.10/LinearAlgebra/src/tridiag.jl:160
  ...

Stacktrace:
 [1] dot(x::SparseConnectivityTracer.HessianTracer{BitSet, Set{Tuple{Int64, Int64}}}, y::Float64)
   @ LinearAlgebra ~/julia/julia-1.10.3/share/julia/stdlib/v1.10/LinearAlgebra/src/generic.jl:876
 [2] dot(x::Vector{SparseConnectivityTracer.HessianTracer{BitSet, Set{Tuple{Int64, Int64}}}}, y::Vector{Float64})
   @ LinearAlgebra ~/julia/julia-1.10.3/share/julia/stdlib/v1.10/LinearAlgebra/src/generic.jl:886
 [3] โ„“(x::Vector{SparseConnectivityTracer.HessianTracer{BitSet, Set{Tuple{Int64, Int64}}}})
   @ Main ./REPL[41]:1
 [4] trace_function(::Type{SparseConnectivityTracer.HessianTracer{BitSet, Set{Tuple{โ€ฆ}}}}, f::typeof(โ„“), x::Vector{Float64})
   @ SparseConnectivityTracer ~/.julia/packages/SparseConnectivityTracer/JvHcU/src/pattern.jl:32
 [5] hessian_pattern(f::Function, x::Vector{Float64}, ::Type{BitSet}, ::Type{Set{Tuple{Int64, Int64}}})
   @ SparseConnectivityTracer ~/.julia/packages/SparseConnectivityTracer/JvHcU/src/pattern.jl:326
 [6] top-level scope
   @ REPL[42]:1
Some type information was truncated. Use `show(err)` to see complete types.

If I also preallocate the storage of c for the Hessian of the augmented Lagrangian, SCT.jl returns an error because the storage of c is not correct.

function compute_hessian_sparsity(f, nvar, c!, ncon)
  if ncon == 0
    x0 = rand(nvar)
    S = SparseConnectivityTracer.hessian_pattern(f, x0)
    return S
  else
    x0 = rand(nvar)
    y0 = rand(ncon)
    cx = similar(y0)
    โ„“(x) = f(x) + sum(c!(cx, x)[i] * y0[i] for i = 1:ncon)
    S = SparseConnectivityTracer.hessian_pattern(โ„“, x0)
    return S
  end
end
Basic Jacobian derivative with backend=ADNLPModels.SparseADJacobian and T=Float64: Error During Test at /home/alexis/Bureau/git/ADNLPModels.jl/test/sparse_jacobian.jl:6
  Got exception outside of a @test
  TypeError: in typeassert, expected Float64, got a value of type SparseConnectivityTracer.HessianTracer{BitSet, Set{Tuple{Int64, Int64}}}
  Stacktrace:
    [1] setindex!(A::Vector{Float64}, x::SparseConnectivityTracer.HessianTracer{BitSet, Set{Tuple{Int64, Int64}}}, i1::Int64)
      @ Base ./array.jl:1021
    [2] (::var"#c!#21")(cx::Vector{Float64}, x::Vector{SparseConnectivityTracer.HessianTracer{BitSet, Set{Tuple{Int64, Int64}}}})
      @ Main ~/Bureau/git/ADNLPModels.jl/test/sparse_jacobian.jl:10
    [3] (::ADNLPModels.var"#35#37"{Vector{SparseConnectivityTracer.HessianTracer{BitSet, Set{Tuple{Int64, Int64}}}}, var"#c!#21", Vector{Float64}, Vector{Float64}})(i::Int64)

Maybe the best solution is to not provide c in-place?

tmigot commented 1 month ago

Hey @gdalle @adrhill ! I love the idea and agree with @amontoison the sparse Jacobian/Hessian is a good starting point.

For the DifferentiationInterface.jl, the best would be if you could provide an example that does this:

T = Float64
n = 2
x0 = rand(T, n)
f(x) = sum(x) # or something
function c!(cx, x)
cx[1] = x[1]^2 + x[2]^2 + x[1] * x[2] 
# a) compute gradient in-place
gx = similar(x0)
# ... โˆ‡f(x)
# b) Hessian-vector product in-place
v = ones(2)
# ... โˆ‡fยฒ(x)*v
# c) Jacobian-vector product
# โˆ‡c!(cx, x) * v
# d) transpose Jacobian-vector product
# โˆ‡c!(cx, x)แต€ * v
# e) Lagrangian Hessian-vector product
# ... (โˆ‡fยฒ(x) + y_i โˆ‡ยฒcแตข!(cx, x))*v

Usually, we store everything you need to compute the derivatives in a backend that is initialized with the model. The assumption is that we will do the derivatives multiple times.

gdalle commented 1 month ago

Here are some answers to your questions:

Is SparseConnectivityTracer.jacobian_pattern(c!, cx, x0) more efficient than SparseConnectivityTracer.jacobian_pattern(c, x0) ? Note that SparseConnectivityTracer.jacobian_pattern(c!, y, x) is not documented.

The rationale behind it is that f!(y, x) is more efficient than y = f(x) for array-to-array functions. Thus, some functions are implemented in-place, and we need to support them in Jacobian sparsity detection. However, the actual gain of efficiency within the sparsity detection is minimal, since lots of allocations occur anyway due to set manipulations.

Is it possible to add an option to only return one triangle of the sparsity pattern of the Hessian? You could add a keyword argument with possible values F (default), L or U. Can we exploit the symmetry of the hessian to determine it more efficiently?

At the moment we do not exploit the symmetry of the Hessian. Internally, the Hessian pattern is represented as a Set{Tuple{Int,Int}}, and we make no efforts to only store one triangle. It's "only" a factor-2 efficiency gain, so it's clearly something we wish we did, but:

In optimization we want the Hessian of the augmented Lagrangian when we have constraints so it means that we want the sparsity structure of the Hessian of โ„“(x) = f(x) + dot(c(x), y). I have an error because SparseConnectivityTracer doesn't like dot. I can hack it with the help of a sum.

Essentially we need to add some array rules to SparseConnectivityTracer, and dot will be one of the very first.

In optimization we want the Hessian of the augmented Lagrangian when we have constraints so it means that we want the sparsity structure of the Hessian of โ„“(x) = f(x) + dot(c(x), y). Can you do better if you we provide f and c instead of directly โ„“ as input? Can we compute in the the sparsity structure of the Jacobian of c(x) and the Hessian of augmented Lagrangian l(x) = f(x) + dot(c(x), y) to have a speed-up?

For SparseConnectivityTracer you can always detect the sparsity pattern of f and c separately, then combine them. But of course it won't be done automatically by the package, so you need to assemble the parts manually.

For DifferentiationInterface, @timholy asked a similar question a week ago, and he had some suggestions:

Feel free to weigh in on those discussions.

If I also preallocate the storage of c for the Hessian of the augmented Lagrangian, SCT.jl returns an error because the storage of c is not correct. Maybe the best solution is to not provide c in-place?

Don't provide c in-place, because SparseConnectivityTracer uses custom number types and operator overloading (like ForwardDiff).

gdalle commented 1 month ago

@tmigot here's an example doing everything you asked. The mantra of DifferentiationInterface is "prepare once, reuse multiple times", so it fits well with an optimization point of view. Note that although the differentiation overwrites the storage you provide, we cannot guarantee that other allocations won't occur.

using DifferentiationInterface
import Enzyme, ForwardDiff, Zygote

forward_backend = AutoForwardDiff()
reverse_backend = AutoEnzyme(Enzyme.Reverse)  # supports mutation
second_order_backend = SecondOrder(AutoForwardDiff(), AutoZygote())

T = Float64
n = 2  # variables
m = 1  # constraints
x0 = rand(T, n)

obj(x) = sum(x)

function cons!(c, x)
    c[1] = sum(abs2, x) - one(eltype(x))
    return nothing
end

# a) gradient

extras_a = prepare_gradient(obj, reverse_backend, x0)
g0 = similar(x0)
gradient!(obj, g0, reverse_backend, x0, extras_a)

# b) Hessian-vector product

extras_b = prepare_hvp(obj, second_order_backend, x0, v0)
p0 = similar(x0)
hvp!(obj, p0, second_order_backend, x0, v0, extras_b)

# c) Jacobian-vector product

c0 = ones(T, m)
extras_c = prepare_pushforward(cons!, c0, forward_backend, x0, v0)
p0 = similar(c0)
pushforward!(cons!, c0, p0, forward_backend, x0, v0, extras_c)

# d) transpose Jacobian-vector product

extras_d = prepare_pullback(cons!, c0, reverse_backend, x0, v0)
p0 = similar(x0)
pullback!(cons!, c0, p0, reverse_backend, x0, v0, extras_d)

# e) Lagrangian Hessian-vector product

# not sure what you mean by this one, can either be obtained by summing the HVPs of `f` and `ci` or by taking the HVP of the Lagrangian as a whole

I think, as above, the main point of discussion will be handling the Lagrangian aspect

gdalle commented 2 days ago

@amontoison as of v0.5.6 (released this morning), DifferentiationInterface supports vector-mode pushforwards, pullbacks and HVPs, which makes Jacobians and Hessians much faster. It only works with ForwardDiff for now, but I want to add Enzyme soon. That was the last ingredient necessary for DI to be feature-complete, at least for one-argument functions. From now on, if something doesn't work well, it will likely be due to a missing or suboptimal implementation, but not to a hole in the design. I think we should start discussing its integration in ADNLPModels. The hardest remaining issue will be handling the Lagrangian, which takes 2 inputs ($x$ and $\lambda$), but I would very much benefit from hashing out possible solutions with you.