JuliaManifolds / Manopt.jl

🏔️Manopt. jl – Optimization on Manifolds in Julia
http://manoptjl.org
Other
314 stars 40 forks source link

Add Warning if gradient larger than injectivity radius. #346

Closed mateuszbaran closed 7 months ago

mateuszbaran commented 7 months ago

This example generates an incorrect descent direction in L-BFGS:

using Manopt, Manifolds, LineSearches

function f_rosenbrock_manopt(::AbstractManifold, x)
    result = 0.0
    for i in 1:2:length(x)
        result += (1.0 - x[i])^2 + 100.0 * (x[i + 1] - x[i]^2)^2
    end
    return result
end

function g_rosenbrock_manopt!(M::AbstractManifold, storage, x)
    for i in 1:2:length(x)
        storage[i] = -2.0 * (1.0 - x[i]) - 400.0 * (x[i + 1] - x[i]^2) * x[i]
        storage[i + 1] = 200.0 * (x[i + 1] - x[i]^2)
    end
    riemannian_gradient!(M, storage, x, storage)
    return storage
end

function test_case_manopt()
    N = 128
    mem_len = 1
    M = Manifolds.Sphere(N - 1)
    ls_hz = LineSearches.HagerZhang{Float64}(alphamax=3.0)

    x0 = zeros(N)
    x0[1] = 1
    manopt_sc = StopWhenGradientNormLess(1e-6) | StopAfterIteration(1000)

    return quasi_Newton(
        M,
        f_rosenbrock_manopt,
        g_rosenbrock_manopt!,
        x0;
        stepsize=Manopt.LineSearchesStepsize(ls_hz),
        evaluation=InplaceEvaluation(),
        vector_transport_method=ParallelTransport(),
        return_state=true,
        memory_size=mem_len,
        stopping_criterion=manopt_sc,
    )
end

test_case_manopt()

EDIT:

Second example; longest gradient is about 0.12 long:

using Revise
using Manopt, Manifolds, LineSearches
using Random, ManoptExamples, LinearAlgebra

Random.seed!(30)

N = 1024

A = Symmetric(2 * randn(N, N) / N)
f_manopt = ManoptExamples.RayleighQuotientCost(A)
g_manopt! = ManoptExamples.RayleighQuotientGrad!!(A)

function test_case_manopt()
    mem_len = 2
    M = Manifolds.Sphere(N - 1)
    ls_hz = LineSearches.HagerZhang{Float64}(alphamax=3.0)

    x0 = zeros(N)
    x0[1] = 1
    manopt_sc = StopWhenGradientNormLess(1e-6) | StopAfterIteration(1000)

    return quasi_Newton(
        M,
        f_manopt,
        g_manopt!,
        x0;
        stepsize=Manopt.LineSearchesStepsize(ls_hz),
        evaluation=InplaceEvaluation(),
        vector_transport_method=ParallelTransport(),
        return_state=true,
        memory_size=mem_len,
        stopping_criterion=manopt_sc,
    )
end

test_case_manopt()

Yet it still fails.

kellertuer commented 7 months ago

I first did an investigation when this appears:

1) N=34 (at iteration 2), 40 (5), 42 (4), 98 (2),110 (3), 112 to 132 (3) 2) For ProjectionTransport() for example also for N=24 (4), 3) It never appears for the Euclidean manifold. 4) Shortening the gradient often helps.

Then I found the culprit. It is the gradient. For the case described here the (Riemannian) gradient in the very first step has a norm of 200.62901086333446.

With any norm larger than 2Ď€ I would call it a game of chance whether any linesearch works, since we have left the infectivity radius already half-way to there. With the norm you get on the 128-sphere it is merely random where on the great arc in that direction you end up on the sphere. and then I am surprised it does not fail much much more often ;)

So I am very convinced, the too large gradient norm is the reason here, and that is more like an inherent manifold-problem and not a bug in Manopt.jl

mateuszbaran commented 7 months ago

Well, then what should I do to perform optimization using L-BFGS? Or is L-BFGS incompatible with such long gradients? Maybe there should be an entry in docs about this?

kellertuer commented 7 months ago

Any optimization algorithm on the sphere is incompatible with gradients larger than twice the infectivity radius. If you have a gradient of length exactly 2π and you would accept a step size 1 – you stay exactly where you are. If you start a line search with that size 1 you are just line-searching around your start point.

All this never makes sense, independent on how you obtained the search direction (gradient descent, CG, QN, all of them).

If one would add something like that to the docs it would have to be somewhere more general than at a specific solver; my understanding for now was, that outside of the infectivity radius, one should probably not look for a next iterate. Outside of twice the infectivity radius, most things stop to make sense at all.

One thing one could add is a DebugWarnifNormTooLarge warning; which would be just a specific case of warning we already have – but then again, I would maybe only perform these tests in “beginners variants”?

mateuszbaran commented 7 months ago

We can always rescale the problem: $f(x) \rightarrow \epsilon f(x)$, $\nabla f(x) \rightarrow \epsilon \nabla f(x)$ for a small positive $\epsilon$ and then there are no excessively long gradients and its minimizer is still the same. Gradient descent could do it automatically, CG and QN would have to do a restart. I'm not suggesting we do that, I just mention that it would be possible.

One thing one could add is a DebugWarnifNormTooLarge warning; which would be just a specific case of warning we already have – but then again, I would maybe only perform these tests in “beginners variants”?

That would be useful I think. It would make debugging easier.

kellertuer commented 7 months ago

We have a few cases (ARC) where we by default have a debug and a warning.

Something like that we could add for “gradient type” methods. The only “disadvantage” would be, that the solver is wrapped by a DebugSolver than and that might cost just a slightly bit of perfomance.

The warning could even recommend the rescaling then – and report both the infectivity radius and the gradient norm, such that one has a first idea how (much) to rescale.

mateuszbaran commented 7 months ago

I'm not sure whether such debug mode should be default or not but it would be helpful to have it.

kellertuer commented 7 months ago

Let's reopen this. I will think about how to best add that, maybe even with an option to (globally) disable this.

One idea would be to have a global option set_manopt_parameter to activate a few more warnings. Loading Manopt this could be active (but turned off for performance). That way we assume a bit of beginner mode by default.

With this mode, we would add a debug=[...] thing to activate the gradient length warning.

I will try to do this soon-ish – it does not sound too complicated to add.

kellertuer commented 7 months ago

I am also not so sure it should be on by default, for beginners that might be nice – that is why I am thinking about a global switch (something like Manopt.BEGINNERMODE but a better name).

That one would set a keyword argument in the single solvers beginner_mode and if that is set the debug would be set to warn by default.

That way 1) Beginners have a nice mode in general, experts can change that 2) expersts can easily use it for single solvers as well 3) one just has to set something to true or 1 or so (and has not to remember the actual debug=... stuff

But sure, we need a good name. And a decision whether this should be on or off by default.

mateuszbaran commented 7 months ago

Maybe we could use https://github.com/JuliaPackaging/Preferences.jl to store this setting?

kellertuer commented 7 months ago

Concerning the updated code above (oh we now mix 2 things here). I can check what best to do there, would first recommend to maybe use retractions and vector transports – when they are numerically more stable (though maybe mathematical a bit imprecise)

edit: For the updated one that fails to find a direction, I have no idea. It seems fairly independent of

On the other hand it for example works fine if I set for example alphamax to for example 1.0 (I have no clue what that does, max scale factor?).

And on the other hand seems to work fine for nearly all spheres I chose smaller than the value you report. It also only happens after quite some steps.

With alphamax = 1 it even works for N=2^13 or N=2^14 (16k), after that my memory is slowly giving up on generating A (but it works both for exp/log/PT as well as the ProjectionX family). So maybe it is this magic alphamax?

edit: even not passing alphamax I can not find a case QN fails. If that is some kind of maximal step length it should of course be capped ad infectivity radius (then 3 would be too large), if it is more like a scaling of the gradient – I am not sure; it is by no means documented. One thing I did notice is that in a lot of cases alphamax is the step size used – then this might indeed again lead to strange effects if this is too large – but from HagerZhang I can not understand the effect (since it is really to no extend documented).

mateuszbaran commented 7 months ago

alphamax is the largest stepsize considered by the linesearch. 3.0 is by no means a good choice, it's just a crude patch for H-Z from LineSearches.jl.

The issue is quite rare. I discovered it originally for my improved HZ linesearch and then had to search for a matrix that gives short gradients but causes a wrong search direction.

kellertuer commented 7 months ago

Preferences.jl works perfectly fine locally already, I just need a good name for the option now ;)

mateuszbaran commented 7 months ago

HELPFUL_MODE sounds fine to me but if you have a better idea, go for it. Or you can ask for name suggestions on some channel on Slack.

kellertuer commented 7 months ago

I was thinking of maybe "Mode" to be set to "Tutorial" where one could in general provide a few intro-style help warnings here and there.

kellertuer commented 7 months ago

that avoids helpful (I am not sure whether its helpful) nor beginner (maybe it might also be helpful for non beginners).

Tutorial is a bit like when you start a game that you get a bit introduced to the gameplay and world :)

mateuszbaran commented 7 months ago

Tutorial mode is a good name but I wouldn't stretch the analogy to games too far (for example Manopt could have achievements but maybe it should not :wink: )

kellertuer commented 7 months ago

The PR resolving this should be finished :)

mateuszbaran commented 7 months ago

Cool! I will take a look.