gbaydin / hypergradient-descent

Hypergradient descent
MIT License
136 stars 20 forks source link

Learning rate increases? #4

Open myui opened 5 years ago

myui commented 5 years ago

I have a question about the following part of the paper:

9e9cf4addd40cb92ef2f4b4833d8c2e3

In the dot product, ∇f(θ_{t-1})・∇f(θ_{t-2}), sign(∇f(θ_{t-1})) and sign(∇f(θ_{t-2})) often be same for sign.

Then, learning rate α_{t} would be monotonically increases in the above equation where sign(∇f(θ_{t-1})) = sign(∇f(θ_{t-2})).

I assume the difference between a gradient at t-1 and the previous gradient at t-2 is usually small.

Am I missing something?

akaniklaus commented 5 years ago

Well, I don't know about the equation but practically, it first increases the learning rate if it is too low. @gbaydin I also experienced that the learning rate sometimes get negative especially if hypergrad_lr is high. Should we maybe place a constraint (e.g. clipping) to prevent that from happening?

myui commented 5 years ago

By modifying algorithms described in the original paper, adam-hd worked fine https://github.com/apache/incubator-hivemall/blob/master/core/src/main/java/hivemall/optimizer/Optimizer.java#L674

This thesis (multiplicatative hypergradient descent) helped. https://github.com/damaru2/convergence_analysis_hypergradient_descent/blob/master/dissertation_hypergradients.pdf

Negative learning rate can be seen in the original experiments but it's accepted in my understanding. Some clipping might help though.

akaniklaus commented 5 years ago

@myui What do you mean by fine? What was exactly wrong with the version that is in this repository?

myui commented 5 years ago

@akaniklaus learning rates monotonically increased certain condition because ∇f(θ_{t-1})・∇f(θ_{t-2}) will usually become greater than 0.

gbaydin commented 5 years ago

@myui if you look at the results in the paper and in David Martinez's thesis, you can see that the algorithms, as they are formulated in the paper, can both increase and decrease the learning rate according to the loss landscape. I think your interpretation that a monotonically increasing learning rate would be observed is not correct. It is, however, correct that a small initial learning rate is most of the time increased (almost monotonically) up to some limit in the initial part of the training, but if you run training long enough, this is almost always followed by a decay (decrease) of the learning rate during the rest of the training. The poster here gives a quick summary: https://github.com/gbaydin/hypergradient-descent/raw/master/poster/iclr_2018_poster.pdf

You can of course have your own modifications of this algorithm.

gbaydin commented 5 years ago

Negative learning rates sometimes happen, and it's not as catastrophic as it first sounds. It just means that the algorithm decides to backtrack (do gradient ascent instead of descent) under some conditions. In my observation, negative learning rates happen in the late stages of training where the learning rate has decayed towards a very low positive value and started to fluctuate around it. If the fluctuation is too strong, and if the decayed value is close to zero, this means that sometimes learning rate becomes negative. I think this in effect means that the algorithm stays in the same region of the loss landscape because it has converged to a (local) optimum. My view is that it is valuable to reason about this behavior and pursue a theoretical understanding of its implications, rather than adding extra heuristics to "fix" or clip this behavior. I haven't had much time to explore this yet, but hope to do so in the near future.

myui commented 5 years ago

Backtracking make sense.