negative learning rate?

gbaydin / hypergradient-descent

Hypergradient descent

MIT License

136 stars 20 forks source link

negative learning rate? #1

Closed shaibagon closed 6 years ago

shaibagon commented 6 years ago

Very interesting work and thanks for sharing your implementation.

I was trying SGDHD optimizer on a learning task, and while inspecting the actual learning rate (the values of optimizer['param_groups'][0]['lr']) I noticed that sometimes the actual learning rate becomes negative (e.g., -9.65e-06). Although the value is very small, it is still very weird to see a negative value for the learning rate. Shouldn't the code cap the value at zero? Is this a normal behavior?

BTW, I started with lr: 0 and hypergrad_lr: 1e-8.

Thanks! -Shai

gbaydin commented 6 years ago

Hi @shaibagon, a negative learning rate can (and does) occur during the late stages of training where the learning rate decays to a value around zero and fluctuates around it. Because the learning rate is fluctuating around zero, it takes both positive and negative values that are small in absolute value.

This is a behavior we've observed in many problems and it does not seem to cause a destructive behavior. We don't yet have a theoretical analysis of this, but I suspect that the learning rate converges to zero in expectation.

The conference poster also shares this empirical finding: https://github.com/gbaydin/hypergradient-descent/raw/master/poster/iclr_2018_poster.pdf

shaibagon commented 6 years ago

@gbaydin Thank you very much for clarifying this matter. Closing this as "non-issue".