Generalize auto-tuning in ADVI to grid search function, for auto-tuning in L-BFGS

alashworth commented 5 years ago

Issue by dustinvtran Sunday Feb 28, 2016 at 23:09 GMT Originally opened as https://github.com/stan-dev/stan/issues/1780

(There are a number of things I plan to generalize in the code base as I start working on the GMO implementation in Stan. I'm documenting the issues here so I don't forget.)

See title. We can use the same grid search to optimize hyperparameters for many things in Stan. The easiest extension at the moment is to make the grid search a stand-alone function, for auto-tuning the initial step-size parameter in L-BFGS.

alashworth commented 5 years ago

Comment by avehtari Monday Feb 29, 2016 at 12:22 GMT

Is this grid search autotuning described somewhere? Maybe Bayesian optimisation could be used to improve autotuning?

alashworth commented 5 years ago

Comment by dustinvtran Thursday Mar 03, 2016 at 21:03 GMT

Just checked CmdStan and the Stan manual. It's mentioned briefly in both but not to much detail. The idea is very simple: there is a grid of possible values for the step-size hyperparameters; try each one for a fixed number of iterations; choose whichever lead to fastest increase in the ELBO after that fixed number of iterations. Instead of doing all of them and deciding after the fact, it goes through the step-size choices monotonically so that it can tell whether to stop checking all the grid values early and run with a particular setting.

re: your second question, absolutely! That's the end-goal for refactoring all these heuristics for hyperparameter selection, so that we can employ meta algorithms strictly for that task.

alashworth commented 5 years ago

Comment by bob-carpenter Thursday Mar 03, 2016 at 21:36 GMT

Are these problems monotonic? And how can it set step sizes when it's still out in the tail? The struggle we have with NUTS is to run long enough to get to the typical set where the real adaptation can kick in. The danger is if you only set step sizes in the tail regions, they can be very flat compared to the typical set and lead to overly large step sizes if set too early.

Sorry if this is all obvious!

On Mar 3, 2016, at 4:03 PM, Dustin Tran notifications@github.com wrote:

Just checked CmdStan and the Stan manual. It's mentioned briefly in both but not to much detail. The idea is very simple: there is a grid of possible values for the step-size hyperparameters; try each one for a fixed number of iterations; choose whichever lead to fastest increase in the ELBO after that fixed number of iterations. Instead of doing all of them and deciding after the fact, it goes through the step-size choices monotonically so that it can tell whether to stop checking all the grid values early and run with a particular setting.

re: your second question, absolutely! That's the end-goal for refactoring all these heuristics for hyperparameter selection, so that we can employ meta algorithms strictly for that task.

— Reply to this email directly or view it on GitHub.

alashworth commented 5 years ago

Comment by dustinvtran Thursday Mar 03, 2016 at 21:57 GMT

Stochastic optimization faces the same problem but to a lesser extent. Recall the learning rate is the quantity that multiplies the gradient and determines the magnitude at which the parameters are updated using that gradient's direction. The learning rate is a step size which decreases over time (iteration count) and guarantees convergence to an optima excluding numerical issues.

Practically, the algorithm still needs a good specification of the learning rate when it's off in the tail regions, but only in the tail region. Once it's in the relative vicinity of the posterior space, it will still bounce around a lot, proportional to the magnitude of the learning rate, but it will eventually decrease to 0. Therefore a misspecification of the constant scale factor won't be too big an issue. Ideally, we would hyperoptimize the gradient descent procedure so that we learn the optimal magnitude at which to update the parameter for each iteration. This is impractical of course, because that requires running a mini-optimization step per iteration. Therefore for now we settled with this mini-optimization at the beginning. A related notion is probabilistic line searches by Maren Mahsereci and Philipp Hennig.

We thought about related ideas for assessing convergence ,and to get more quickly to the optima without waiting for all that bouncing around to have 0 variance. It's still in the preliminary stages of research though.

alashworth commented 5 years ago

Comment by syclik Friday May 27, 2016 at 17:16 GMT

@dustinvtran, is there a path forward with this issue? If not, could we close it for now?

alashworth / test-issue-import

Generalize auto-tuning in ADVI to grid search function, for auto-tuning in L-BFGS #90