Open Maximilian-Stefan-Ernst opened 2 years ago
@Maximilian-Stefan-Ernst I'm not sure how Optim deals with Inf
values. From what you say, I'm guessing you are trying to deal with constraints in some smooth term (having it return Inf
when some condition is met?), on which algorithms will evaluate the gradient: I'm not sure how automatic differentiation deals with that situation, but I can see it may have issues with it (for a function to be differentiable at a point, it must be finite there).
Can you give an example of what you tried?
Yes! Okay, so our objective is of the form x -> logdet(Σ(x)) + tr(inv(Σ(x)))
, where x is a real valued vector of parameters, and Σ is an intermediate matrix constructed from the parameters, that has to be positive definite. The final objective is then a function of Σ, i.e. logdet(Σ) + tr(inv(Σ))
.
So what works fine for us with Optim and NLopt is just to check if Σ(x)
is positive definite and just return Inf
if that is not the case. An MWE using automatic differentiation looks like this:
using Zygote, Optim, ForwardDiff, ProximalAlgorithms, ProximalOperators, LinearAlgebra, ProximalCore
function f(x)
# in reality, Σ is a more complicated function of the parameters
Σ = [x[1] 0.0
0.0 x[2]]
# Since Σ is diagonal, it is positive definite iff x[1] > 0, x[2] > 0
if !isposdef(Σ)
return Inf
else
return logdet(Σ) + tr(inv(Σ))
end
end
# works just fine
sol = Optim.optimize(f, [2.0, 2.0], BFGS(); autodiff = :forward)
# finds the correct values of [1.0; 1.0]
sol.minimizer
# the same with ProximalAlgorithms
ProximalAlgorithms.PANOC()(x0 = [2.0, 2.0], f = f, g = NormL1(0.0))
# ERROR: MethodError: Cannot `convert` an object of type Nothing to an object of type Float64
# The reason seems to be that different autodiff backends handle inadmissible values differently
ForwardDiff.gradient(f, [-1.0, -1.0])
# returns [0.0; 0.0]
Zygote.gradient(f, [-1.0, -1.0])
# returns (nothing,)
In reality, for performance reasons, we give analytical gradients instead. For inadmissible parameter values, we just fill the gradient with ones (i.e. anything that does not fulfill Optim's convergence criteria on the gradient norm).
However, this fails for ProximalAlgorithms:
struct myproblem end
function ProximalCore.gradient!(grad, f::myproblem, x)
Σ = [x[1] 0.0
0.0 x[2]]
if !isposdef(Σ)
grad .= 1.0
return Inf
else
∇Σ = [1.0 0.0
0.0 0.0
0.0 0.0
0.0 1.0]
grad' .= vec(inv(Σ) - inv(Σ)^2)'*∇Σ
return logdet(Σ) + tr(inv(Σ))
end
end
ProximalAlgorithms.PANOC()(x0 = [2.0, 2.0], f = myproblem(), g = NormL1(0.0))
# output: ([-6.39742852169722e17, -6.39742852169722e17], 11)
but it starts to work again if we return floatmax(...)
instead of Inf
:
function ProximalCore.gradient!(grad, f::myproblem, x)
Σ = [x[1] 0.0
0.0 x[2]]
if !isposdef(Σ)
grad .= 1.0
return floatmax(eltype(x)) # <- only this line changed
else
∇Σ = [1.0 0.0
0.0 0.0
0.0 0.0
0.0 1.0]
grad' .= vec(inv(Σ) - inv(Σ)^2)'*∇Σ
return logdet(Σ) + tr(inv(Σ))
end
end
ProximalAlgorithms.PANOC()(x0 = [2.0, 2.0], f = myproblem(), g = NormL1(0.0))
# output: ([0.9999999997915161, 0.9999999997915161], 7)
Sorry for the long post, but I could not find an easier MWE that reproduces the problem.
Sorry for the long post, but I could not find an easier MWE that reproduces the problem.
Not at all, thanks for the detailed report! Leaving aside the AD issues, PANOC
does a line-search internally, which I think may severely break in case the function evaluates to Inf
; there’s a pretty strong requirement on f
being real-valued (and actually smooth) to guarantee convergence.
The PANOCplus
variant may work better in case f
is simply differentiable (but then again, the trick being used here really make it non-differentiable). Otherwise ForwardBackward
is free from line-searches, in case you manually provide a step-size, but then this should be small enough compared to the Lipschitz constant of the gradient of f
.
To summarize, I’m not surprised if things break in your case 🙂 and now I’m curious how Optim deals with it.
If you don't want questions in your issues, please just delete.
We have complicated constraints that can not (or at least it is not obvious how to) be enforced via a proximal operator. However, with Optim.jl we've had good experiences just returning
Inf
whenever the constraints are not met (and there are some theoretical reasons why this should work as well as a lot of practical experience).Using ProximalAlgorithms, this strategy leads to convergence problems. Do you know if ProximalAlgorithms handles
Inf
values differently than Optim? Is there any advantage of returningfloatmax(...)
or a very high number instead ofInf
?