ekmett / ad

Automatic Differentiation
http://hackage.haskell.org/package/ad
BSD 3-Clause "New" or "Revised" License
368 stars 73 forks source link

Gradient descent fails to converge (to the right result). #45

Open echatav opened 9 years ago

echatav commented 9 years ago

Consider the function,

let f [m,b] = (6020.272727m+b-23680.30303)^(2::Int) + (7254.196429m+b-28807.61607)^(2::Int) + (6738.575342m+b-26582.76712)^(2::Int) + (5464.658537m+b-23894.34756)^(2::Int)

It's a smooth convex function and gradient descent should converge to [m,b] = [2.911525576,7196.512447]. See Wolfram Alpha.

Instead,

(!! 10) $ gradientDescent f [0,0]

[6.503543385615585,1.0522541316691529e-3]

(!! 100) $ gradientDescent f [0,0]

[4.074543700217947,1.0251230667415617e-3]

(!! 1000) $ gradientDescent f [0,0]

[4.028579379271611,4.621971888645496e-3]

(!! 10000) $ gradientDescent f [0,0]

[4.02857388945333,4.100383031651195e-2]

barak commented 9 years ago

Do you have any reason to think that it isn't converging to the minimum? What you show is that in 10000 iterations it hasn't made much progress. But it does seem to be going in the right direction, albeit very slowly!

Seriously, this gradientDescent code was not intended for serious use, it was written as a placeholder and is completely untuned, designed to be robust but potentially very slow. The conjugateGradientDescent routine looks to be in better shape (although its line search does 20 Newton steps instead of the more appropriate single Newton step, and there is no code to check for the objective changing in the right direction or trust regions or anything like that) and barring bugs should converge in just a handful of line searches for a two-dimensional quadratic objective function like this.

echatav commented 9 years ago

Seriously, this gradientDescent code was not intended for serious use

I didn't realize that :-/ I've been trying to make serious use of it.

echatav commented 9 years ago

@barak : Do you know what the best way to calculate argmins of convex objective functions in Haskell is?

barak commented 9 years ago

Sorry. I blame ... not sure but I'll think of someone.

As to a better routine, the present conjugate gradient routine looks reasonable for a high dimensional system, although the line search could use a bit of robustness-enhancing love. And it works well here, moves rapidly to a point close to the optimum:

Prelude Numeric.AD> take 4 $ conjugateGradientDescent f [0,0]
[[0.0,0.0],[4.0285774585177165,4.7048094968615557e-4],[2.910697209479748,7196.113511627576],[2.911504913045699,7196.645421944485]]

For something low dimensional like the test problem you give, a simple Newton's method would also make sense. Or for robustness one with trust regions, or for something of intermediate dimensionality, BFGS or Levenberg-Marquardt. There is a curated collection of optimization routines in http://lmfit.github.io/lmfit-py/ and I think it would make a lot of sense to translate them from Python to Haskell, using the ad library for the necessary gradients of course.

echatav commented 9 years ago

I've found this library optimization which seems designed to fit with ad and linear.

TomMD commented 8 years ago

On this topic, what properties hold for conjugate descent? I'm noticing the progression is actually ascending, not descending. This is for a parabolic function with a log barrier so I'm guessing there is an issue with infinity.

:{
let obj2 [x] = x^2 +
              let bar = (-1/1000) * log (- (5-x) )
              in if isNaN bar then 1/0 else bar
              -- obj2 is asymptotically approaching x=5
              -- obj2 = inf when x <= 5  
:}
conjugateGradientDescent obj2 [20]

~~> [[20.0],[0.0]]

Notice the lack of descent:

map obj2 it
[399.9972919497989,Infinity]
TomMD commented 8 years ago

As a matter of update / hijacking of this issue. The above objective function, obj2, retains no information about the gradient past the asymptote. Changing the numerator of 1/0 to what is commonly named µ - (5-x)/0 in this case - yields a differentiable value (via Kahn.grad) but in general the objective function must be twice differentiable for conjugateGradient.

Perhaps I should send a pull request indicating

  1. gradientDescent is not considered 'industrial strength'.
  2. conjugateGradient* objective functions must be twice differentiable.

Thoughts?