SMART-Lab / smartlearner

SMART-Learner is a machine learning library built with researchers in mind.
BSD 3-Clause "New" or "Revised" License
10 stars 4 forks source link

Define the Optimizer.py #7

Open MarcCote opened 9 years ago

MarcCote commented 9 years ago

Definitively the notion of optimizer is somewhat fuzzy and so is the class Optimizer.

We should attempt to clarify definitions we are going to use in the library.

Definitions (to be added in a wiki)

Some questions

Suggestions

ASalvail commented 9 years ago

Some questions

MarcCote commented 9 years ago

@ASalvail I moved some of your comments into the original post (because it seems I can do that!).

I'm not familiar with SAG. Knowing the example is not enough, you need the id because you keep an history of the past gradients for each example. Is that it?

I think the term UpdateRule is unclear and refers to many different parts of the optimization process. This is probably why we have a hard time tracing the line between UpdateRule and Optimizer. A couple months ago @mgermain proposed the terms DirectionModifier and ParamModifier. For me, we should be able to combine multiple DirectionModifiers and I don't see it with ADAGRAD, adam and Adadelta.

For instance, examples of reusable and combinable DirectionModifiers:

So, what I have in mind for the core of a first order optimizer (e.g. SGD) is something that looks like this:

  1. Get an initial descent direction (usually the gradient)
  2. Apply some DirectionModifiers
  3. Update parameters
  4. Apply some ParamModifiers
  5. Rinse and repeat

I can see ADAGRAD, Adam, Adadelta being called optimizers. They would inherit from SGD (or maybe a new class FirstOrderOptimizer) and use a custom DirectionModifier class (that may or may not be reusable).

So users would only have to specify --optimizer ADAGRAD to use it. In addition, users that want to do something funky could still specify the optimizer ADAGRAD and provide additional DirectionModifiers.

What do you think?

ASalvail commented 9 years ago

@MarcCote That's exactly how SAG proceeds: it stores the gradient of all examples in order to get it's gradient average computation right.

Those modifiers could be useful as building blocks for the optimizer, but I don't think it'd be useful to use them out of it. If you want a new fancy optimizer, subclass it.