Adaptive hyper-parameter optimization

mrocklin commented 6 years ago

Current options for hyper-parameter optimization (grid search, random search) construct a task graph to search many options, submit that graph, and then wait for the result. However, we might instead submit only enough options to saturate our available workers, get some of those results back and then based on those results submit more work asynchronously. This might explore the parameter space more efficiently.

This would deviate from vanilla task graphs and would probably require the futures API (and the as_completed function specifically) which enforces a strict requirement on the dask.distributed scheduler.

Here is a simplified and naive implementation to show the use of dask.distributed.as_completed

initial_guesses = [...]
futures = [client.submit(fit_and_score, params) for params in initial_guesses]

seq = as_completed(futures, with_results=True)
for future, result in seq:
    if good_enough():
        break
    params = compute_next_guess()
    new_future = client.submit(fit_and_score, params)
    seq.add(new_future)

I strongly suspect that scholarly work exists in this area to evaluate stopping criteria and compute optimial next guesses. What is the state of the art?

Additionally, I suspect that this problem is further complicated by pipelines. The hierarchical / tree-like structure of pipeline/gridsearch means that it probably makes sense to alter the parameters of the latter stages of the pipeline more often than the earlier stages. My simple code example above probably doesn't do the full problem justice.

cc @ogrisel @stsievert @GaelVaroquaux

rth commented 6 years ago

I strongly suspect that scholarly work exists in this area to evaluate stopping criteria and compute optimial next guesses. What is the state of the art?

Computing the next guess could be done e.g. with scikit-optimize; see in particular the Bayesian optimization approach in gp_minimize. Also see MOE though it's a heavier dependency and it's Python packaging doesn't seem to be ideal yet.

There are also the Tree-structured Parzen Estimator (TPE) approaches. For instance, this NeuPy blog does a general overview of this topic (it also includes a good list of references at the end).

I don't know whether any of this is robust enough to consistently outperform random hyper-parameter search for the algorithms used in dask-ml though.

ogrisel commented 6 years ago

Here is another good candidate algorithm for the sequential learning part: http://blog.dlib.net/2017/12/a-global-optimization-algorithm-worth.html

Most (all?) of those sequential learning algorithms will yield one next suggestion at a time though. They have to be tweaked to be cast into: given current state give me the next n_workers to explore approximately in parallel next. It's probably doable to hack some tabu search-like heuristics such as:

give state s_t, what is the next best guess g_(t, 0) = make_guess(s_t), schedule eval(g_(t, 0)),
while waiting for eval(g_(t, 0)) to happen, let's pretend that it has returned some average/median value e_(t, 0) and ask for the next "second best guess" g_(t, 1) = compute_next_guess(s_t \union e_(t, 0)) and schedule eval(g_(t, 1)) to run in parallel of eval(g_(t, 0))
continue pretending that you did get immediate but average results of all the recently scheduled evaluation as long as there are free workers by generating third, fourth, ... best guesses.

As soon as we get the true result of any evaluation, let's update state s_t into s_(t+1) and use that to generate the subsequent evaluation candidate.

TomAugspurger commented 6 years ago

Most (all?) of those sequential learning algorithms will yield one next suggestion at a time though.

https://arxiv.org/pdf/1206.2944.pdf gives one strategy for choosing the next point given N finished optimizations and J pending ones (section 3.2)

Instead we propose a sequential strategy that takes advantage of the tractable inference properties of the Gaussian process to compute Monte Carlo estimates of the acquisiton function under different possible results from pending function evaluations.

ogrisel commented 6 years ago

Interesting. My variant is probably going to explore more than what is proposed in "Practical Bayesian Optimization of Machine Learning Algorithms".

stsievert commented 6 years ago

I think this is a critical issue. Earlier this year, I spent many hours tuning hyperparams, and didn't have the compute resources available to waste time on poor performing models. Eventually, I had to reduce my parameter space to a very simple space (which cost me at least two months of research effort).

to evaluate stopping criteria and compute optimial next guesses.

Hyperband treats hyperparam optimization as a resource allocation problem, not one of fitting + optimizing some function. They try to speed up random evaluation of hyperparams and do not try to perform any sort of optimization. They do this by devoting compute time to well-performing models.

This is possible by taking advantage of the iterative training process for ML models. At first, Hyperband trains many models for a few iterations. It then stops or "kills" some fraction of the worst performing and repeats. i.e., part of Hyperband trains 100 models for 1 iteration, then the best 50 models for 2 iterations, then the best 25 models for 4 iterations and so on. The theory backing this reformulation (the multi-armed bandit problem) is natural and well-studied (but takes effort to apply it here).

They show it's competitive with other Bayesian methods (spearmint, TPE, SMAC) as well as random search, and also random search with twice the computational budget in figures 8 (which I think is their most realistic example):

I think the input of computational budget not configs to evaluate makes it even more attractive. I'm planning on working to verify any gains, then implementing this algorithm in Dask. I'll be on this soon: I'll be more free after May 4th, and certainly after the semester ends.

stsievert commented 6 years ago

A while back @mrocklin mentioned that building a framework for adaptive hyperparam selection could allow easier integration into dask-searchcv. I think it'd look something like

alg = AdaptiveHyperParameterSearch()

losses = None
while True:
    models = [model.set_params(config) for config in alg.get_configs(models, losses)]
    losses = [validation_loss(model) for model in models]
    models = alg.filter(models, losses)
    if alg.stop:
        return alg.best_model, alg.best_config

This is the framework required for adaptive algorithms. It allows for both getting a query with get_configs and processing the answer with filter.

ogrisel commented 6 years ago

But is alg.get_config a blocking code that waits for all the currently scheduled tasks to complete before returning (therefore implementing a generation based loop)? I think I like to have main loop use the raw dask.distributed API explicitly rather than wrapping it into the class.

initial_guesses = [...]
problem_definition = ...
optimizer = MyOptimizer(problem_definition)
futures = [client.submit(fit_and_score, params)
                for params in optimizer.initial_guesses(n_workers)]

seq = as_completed(futures, with_results=True)
for future, result in seq:
    optimizer.add_result(result)
    if optimizer.has_converged():
        break
    params = optimizer.compute_next_guess()
    new_future = client.submit(optimizer.step, params)
    seq.add(new_future)

jimmywan commented 5 years ago

I've been following this off/on for a while and very excited to see this become available. Any update on when the work might be complete?

TomAugspurger commented 5 years ago

I believe Scott is planning to finish it off in a couple weeks.

On Thu, Mar 7, 2019 at 3:36 PM Jimmy Wan notifications@github.com wrote:

I've been following this off/on for a while and very excited to see this become available. Any update on when the work might be complete?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/161#issuecomment-470704408, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIipch4mxlZ6YA1WEwvP2oJanC5vlks5vUYZtgaJpZM4TblQo .

stsievert commented 5 years ago

I believe Scott is planning to finish it off in a couple weeks.

Yup, that's the plan with #221.

jimmywan commented 5 years ago

Saw that #221 just got merged. Woohoo! Can't wait to use this!

TomAugspurger commented 4 years ago

https://distill.pub/2020/bayesian-optimization has some good background reading on Bayesian optimization.

stsievert commented 4 years ago

https://distill.pub/2020/bayesian-optimizations has some good background

Thanks! I like the simple explanation.

Hyperband and bayesian optimization work well together, and has seen some previous work (https://ml.informatik.uni-freiburg.de/papers/18-ICML-BOHB.pdf). Hyperband samples parameters randomly, and runs the different brackets completely in parallel. Instead, why not refine the parameters space between brackets? The most aggressive bracket is run first to narrow the search space; then a slightly less aggressive bracket with a more refined search space, and so on.

Here's their algorithm:

# bracket 0 has no early stopping; bracket 5 has most early stopping
brackets = reversed(range(6))
params = {...}  # user-defined
for bracket in brackets:
    search = SuccessiveHalvingSearchCV(model, params)
    search.fit(...)

    params = refine(params, search)

# return best performing model

The implementation backing this is at https://automl.github.io/HpBandSter/build/html/optimizers/bohb.html.

dask / dask-ml

Adaptive hyper-parameter optimization #161