Asynchronous algorithms and "Good enough" RandomSearchCV

mrocklin commented 7 years ago

It would be interesting to start exploring asynchronous algorithms within this project using the dask.distributed API. Because this API is somewhat different it might be wise to start with something simple.

One simple application would be to build a variant of RandomSearchCV that, instead of taking a number of candidates to try, instead took a stopping criterion like "have tried 100 options and not improved more than 1%" and then continued submitting computations while this has not been met.

My initial approach to do this would be to periodically check the number of cores I had

client = `distributed.client.default_client()`
ncores = sum(client.ncores().valuse())

and try to keep roughly twice that many candidates in flight

candidate_pool = create_infinite_candidates(parameterspace)
futures = client.map(try_and_score, list(toolz.take(ncores * 2, candidate_pool)))

Then I would consume those futures as they finished

af = distributed.client.as_finished(futures)
for future in af:
    score, params = future.result()
    if score > best:
        best = score
        best_params = params
        ...

and then submit new futures as necessary

    future = client.submit(try_and_score, next(candidate_pool))
    af.add(future)

If we wanted to be cool, we could also check the number of cores periodically and submit more or less accordingly.

cc @jcrist @amueller

amueller commented 7 years ago

Which stopping criterion to use is independent of distributing it, right? This sounds like a nice simple example.

If you're looking for something more complex, the parallelization here would be interesting: https://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf

I think everything except the parallelization is implemented in scikit-optimize. You might need some ML help, maybe @mechcoder is interested ;)

eriknw commented 7 years ago

I have begun dask-patternsearch project that may be useful here. You can think of it as a line-search algorithm generalized to multiple dimensions without using derivatives. Its distinguishing feature is you can throw as many cores at it as you want. In principle, optimizing a 100-dimensional problem (that fits in memory) on 100,000 cores should be just fine. The algorithm isn't particularly fancy, smart, or optimal, but I would say it's better than grid search and Nelder-Mead simplex.

Feel free to direct questions to the repo. Again, the project is brand-spanking-new and is missing such nice things as documentation and tests.

Example usage:

import distributed
from dask_patternsearch import search
client = distributed.Client()

def f(x):
    return x.dot(x)

x0 = np.array([12.3, 4.5])
stepsize = np.array([1, 1])
results = search(client, f, x0, stepsize, stopratio=1e-5)

https://github.com/eriknw/dask-patternsearch

thomasgreg commented 7 years ago

Hi, I've been working on an online/async version of the code here: https://github.com/thomasgreg/dask-searchcv/tree/online-fit.

It ended up involving a refactoring of much of the code in model_selection.py to allow updates to the graph without generating extra tasks (the existing code assumes that the full parameter space is know beforehand).

If this approach looks useful I can clean up the branch and put in a PR (it's a bit messy right now as I was hesitant to just replace the existing batch implementation fully without feedback).

Most tests are passing but for a few edge cases I'm working out (dask delayed object in fit_params, the graph size without cache, ...). There's an example script in: examples/async_randomsearchcv.py.

Best regards, Greg

jcrist commented 7 years ago

Thanks for taking a look at this. I suggest you submit a PR early and we can discuss it more there. At first glance, I'm not sure what your intent is here. What problem are you trying to solve? If the goal is to solve this issue, I think there are simpler solutions that don't involve major changes to the existing code. If the goal is to make the existing classes asynchronous, I also think there are simpler ways to accomplish that. Either way, I still suggest you submit a WIP PR and we can discuss things more there.

thomasgreg commented 7 years ago

Thanks for the quick reply - I've subitted a PR. I understand that big code change is a headache, and in hindsight can see how it might be done in a simpler way with minimal change.

amueller commented 7 years ago

@eriknw you should run a benchmark against spearmint, which is an implementation of the paper I linked to. I think they have a distributed implementation, though that might not be open source.

eriknw commented 7 years ago

Thanks for the suggestion @amueller and letting me know they have a distributed implementation. I plan to perform comparative benchmarks against a couple notable packages and will definitely include spearmint.

I plan to implement adaptive step sizes in dask-patternsearch before putting much effort into benchmarks. Medium term, I would like to allow other optimizers to be easily used in combination with dask-patternsearch. Pattern search optimization readily allows "oracles" to propose trial points, and the method you linked to should be an excellent candidate for an oracle. Longer term, I'd like to see a multi-armed bandit approach to perform real-time resource allocation to each optimizer (i.e., to give more cores to methods most likely to give the most improvement).

I've had a rigorous travel schedule the past month, but I'll continue working on dask-patternsearch soon and should have it reasonably functional and presentable in time for the SciPy conference. I expect we can get dask-searchcv to use dask-patternsearch some time this summer.

dask / dask-searchcv

Asynchronous algorithms and "Good enough" RandomSearchCV #32