Open mrocklin opened 6 years ago
I strongly suspect that scholarly work exists in this area to evaluate stopping criteria and compute optimial next guesses. What is the state of the art?
Computing the next guess could be done e.g. with scikit-optimize; see in particular the Bayesian optimization approach in gp_minimize. Also see MOE though it's a heavier dependency and it's Python packaging doesn't seem to be ideal yet.
There are also the Tree-structured Parzen Estimator (TPE) approaches. For instance, this NeuPy blog does a general overview of this topic (it also includes a good list of references at the end).
I don't know whether any of this is robust enough to consistently outperform random hyper-parameter search for the algorithms used in dask-ml though.
Here is another good candidate algorithm for the sequential learning part: http://blog.dlib.net/2017/12/a-global-optimization-algorithm-worth.html
Most (all?) of those sequential learning algorithms will yield one next suggestion at a time though. They have to be tweaked to be cast into: given current state give me the next n_workers to explore approximately in parallel next. It's probably doable to hack some tabu search-like heuristics such as:
s_t
, what is the next best guess g_(t, 0) = make_guess(s_t)
, schedule eval(g_(t, 0))
,eval(g_(t, 0))
to happen, let's pretend that it has returned some average/median value e_(t, 0)
and ask for the next "second best guess" g_(t, 1) = compute_next_guess(s_t \union e_(t, 0))
and schedule eval(g_(t, 1))
to run in parallel of eval(g_(t, 0))
As soon as we get the true result of any evaluation, let's update state s_t
into s_(t+1)
and use that to generate the subsequent evaluation candidate.
Most (all?) of those sequential learning algorithms will yield one next suggestion at a time though.
https://arxiv.org/pdf/1206.2944.pdf gives one strategy for choosing the next point given N
finished optimizations and J
pending ones (section 3.2)
Instead we propose a sequential strategy that takes advantage of the tractable inference properties of the Gaussian process to compute Monte Carlo estimates of the acquisiton function under different possible results from pending function evaluations.
Interesting. My variant is probably going to explore more than what is proposed in "Practical Bayesian Optimization of Machine Learning Algorithms".
I think this is a critical issue. Earlier this year, I spent many hours tuning hyperparams, and didn't have the compute resources available to waste time on poor performing models. Eventually, I had to reduce my parameter space to a very simple space (which cost me at least two months of research effort).
to evaluate stopping criteria and compute optimial next guesses.
Hyperband treats hyperparam optimization as a resource allocation problem, not one of fitting + optimizing some function. They try to speed up random evaluation of hyperparams and do not try to perform any sort of optimization. They do this by devoting compute time to well-performing models.
This is possible by taking advantage of the iterative training process for ML models. At first, Hyperband trains many models for a few iterations. It then stops or "kills" some fraction of the worst performing and repeats. i.e., part of Hyperband trains 100 models for 1 iteration, then the best 50 models for 2 iterations, then the best 25 models for 4 iterations and so on. The theory backing this reformulation (the multi-armed bandit problem) is natural and well-studied (but takes effort to apply it here).
They show it's competitive with other Bayesian methods (spearmint, TPE, SMAC) as well as random search, and also random search with twice the computational budget in figures 8 (which I think is their most realistic example):
I think the input of computational budget not configs to evaluate makes it even more attractive. I'm planning on working to verify any gains, then implementing this algorithm in Dask. I'll be on this soon: I'll be more free after May 4th, and certainly after the semester ends.
A while back @mrocklin mentioned that building a framework for adaptive hyperparam selection could allow easier integration into dask-searchcv. I think it'd look something like
alg = AdaptiveHyperParameterSearch()
losses = None
while True:
models = [model.set_params(config) for config in alg.get_configs(models, losses)]
losses = [validation_loss(model) for model in models]
models = alg.filter(models, losses)
if alg.stop:
return alg.best_model, alg.best_config
This is the framework required for adaptive algorithms. It allows for both getting a query with get_configs
and processing the answer with filter
.
But is alg.get_config
a blocking code that waits for all the currently scheduled tasks to complete before returning (therefore implementing a generation based loop)? I think I like to have main loop use the raw dask.distributed
API explicitly rather than wrapping it into the class.
initial_guesses = [...]
problem_definition = ...
optimizer = MyOptimizer(problem_definition)
futures = [client.submit(fit_and_score, params)
for params in optimizer.initial_guesses(n_workers)]
seq = as_completed(futures, with_results=True)
for future, result in seq:
optimizer.add_result(result)
if optimizer.has_converged():
break
params = optimizer.compute_next_guess()
new_future = client.submit(optimizer.step, params)
seq.add(new_future)
I've been following this off/on for a while and very excited to see this become available. Any update on when the work might be complete?
I believe Scott is planning to finish it off in a couple weeks.
On Thu, Mar 7, 2019 at 3:36 PM Jimmy Wan notifications@github.com wrote:
I've been following this off/on for a while and very excited to see this become available. Any update on when the work might be complete?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/161#issuecomment-470704408, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIipch4mxlZ6YA1WEwvP2oJanC5vlks5vUYZtgaJpZM4TblQo .
I believe Scott is planning to finish it off in a couple weeks.
Yup, that's the plan with #221.
Saw that #221 just got merged. Woohoo! Can't wait to use this!
https://distill.pub/2020/bayesian-optimization has some good background reading on Bayesian optimization.
https://distill.pub/2020/bayesian-optimizations has some good background
Thanks! I like the simple explanation.
Hyperband and bayesian optimization work well together, and has seen some previous work (https://ml.informatik.uni-freiburg.de/papers/18-ICML-BOHB.pdf). Hyperband samples parameters randomly, and runs the different brackets completely in parallel. Instead, why not refine the parameters space between brackets? The most aggressive bracket is run first to narrow the search space; then a slightly less aggressive bracket with a more refined search space, and so on.
Here's their algorithm:
# bracket 0 has no early stopping; bracket 5 has most early stopping
brackets = reversed(range(6))
params = {...} # user-defined
for bracket in brackets:
search = SuccessiveHalvingSearchCV(model, params)
search.fit(...)
params = refine(params, search)
# return best performing model
The implementation backing this is at https://automl.github.io/HpBandSter/build/html/optimizers/bohb.html.
Current options for hyper-parameter optimization (grid search, random search) construct a task graph to search many options, submit that graph, and then wait for the result. However, we might instead submit only enough options to saturate our available workers, get some of those results back and then based on those results submit more work asynchronously. This might explore the parameter space more efficiently.
This would deviate from vanilla task graphs and would probably require the futures API (and the as_completed function specifically) which enforces a strict requirement on the dask.distributed scheduler.
Here is a simplified and naive implementation to show the use of
dask.distributed.as_completed
I strongly suspect that scholarly work exists in this area to evaluate stopping criteria and compute optimial next guesses. What is the state of the art?
Additionally, I suspect that this problem is further complicated by pipelines. The hierarchical / tree-like structure of pipeline/gridsearch means that it probably makes sense to alter the parameters of the latter stages of the pipeline more often than the earlier stages. My simple code example above probably doesn't do the full problem justice.
cc @ogrisel @stsievert @GaelVaroquaux