dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
885 stars 255 forks source link

Capture parameters from scikit-optimize for hyperparameter optimization #300

Open mrocklin opened 6 years ago

mrocklin commented 6 years ago

The Scikit-optimize BayesSearchCV class likely has internal functions that help it determine new parameters given scores on a set of older parameters. It would be useful to either introduce Dask to scikit-optimize or else reuse this logic and implement our own BayesSearchCV. This could be used with standard fit style algorithms or with Incremental if we have a nice early-stopping criterion.

This came out of conversation with @ogrisel

mrocklin commented 6 years ago

cc @betatim, who seems to be the leading developer of scikit-optimize

dhirschfeld commented 6 years ago

I would love to see closer integration of distributed and scikit-optimize! :heart:

betatim commented 6 years ago

Most algorithms in scikit-optimize are a bit tricky to parallelise (beyond say four or so instances). Basic reasoning: we look at past results to pick "the best" next point. So if we have to pick the 16 next points now to evaluate in parallel we will make a worse decision than if we do four batches of four where we can use the feedback of earlier batches when picking the next batch.

What is the starting point to use dask in scikit-optimize? Would we aim to make something like https://github.com/dask/dask-examples/blob/master/machine-learning.ipynb work?

mrocklin commented 6 years ago

So if we have to pick the 16 next points now to evaluate in parallel we will make a worse decision than if we do four batches of four where we can use the feedback of earlier batches when picking the next batch.

Agreed. This seems to be the nature of speculative work. My guess would be that we can still get some value out of exploration, even if it's not optimal. In many cases the extra computation is somewhat free and people are willing to waste a bit.

I don't know this problem space well though, so I may be missing something important.

What is the starting point to use dask in scikit-optimize? Would we aim to make something like https://github.com/dask/dask-examples/blob/master/machine-learning.ipynb work?

I don't know much about the practice of machine learning, but yes, I imagine that replacing GridSearchCV with something that was a bit more intelligent in that problem would be a good start. You might also be able to suggest a better example to parallelize.

betatim commented 6 years ago

I will have a think and see if someone wants to try out what happens if we switch out the backend.

The comment about parallelising was more a caveat emptor as often people find scikit-optimize and are surprised by the relatively low parallel work factor/excitement of doing things in parallel.

betatim commented 6 years ago

@iaroslav-ai @glouppe should we discuss here or create a new issue in scikit-optimize?

glouppe commented 6 years ago

I dont mind having the discussion here.

I am all for a better integration with Dask, provided we clearly define what that means / that requires from us.

mrocklin commented 6 years ago

So I tried things today with joblib and the following example:

from dask.distributed import Client, progress
import distributed.joblib  # needed until sklearn 0.20
client = Client(processes=False)

from skopt import BayesSearchCV
# parameter ranges are specified by one of below
from skopt.space import Real, Categorical, Integer

from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.externals.joblib import parallel_backend

X, y = load_iris(True)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75,
                                                    random_state=0)

# log-uniform: understand as search over p = exp(x) by varying x
opt = BayesSearchCV(
    SVC(),
    {
        'C': Real(1e-6, 1e+6, prior='log-uniform'),
        'gamma': Real(1e-6, 1e+1, prior='log-uniform'),
        'degree': Integer(1,8),
        'kernel': Categorical(['linear', 'poly', 'rbf']),
    },
    n_iter=32,
    n_jobs=1,
)

with parallel_backend('dask'):
    # executes bayesian optimization
    opt.fit(X_train, y_train)

and the dashboard shows tasks like the following:

image

Each green rectangle is a call to fmin_l_bfgs_batch, which takes around 50ms. We're seeing a lot of dead space in between tasks, likely meaning that skopt is doing some non-trivial work in between sending out batches.

After running this experiment I have some questions:

  1. Is scikit-optimize simple enough that just using joblib is the right way to go? Or are there things that we would want to do with a more expressive interface
  2. In particular, I might suggest trying to solve this problem with the concurrent.futures interface. You would submit lots of tasks (presumably a bit more than you had cores), then as they completed you would update your next best guess and submit that one out to run with the others. This is maybe a similar problem to what is running today, you just have more tasks in flight.
  3. Is there a problem with a larger cost so that we can see how much the overhead hurts in a realistic setting?
  4. What is the cause of the overhead here?

    Actually I profiled a bit. It's spending a bit of time in scipy.optimize, coming from the skopt/optimizer/optimizer.py::Optimizer._tell method.

mrocklin commented 6 years ago

It would help me make suggestions if I knew a little bit more about how the computation in scikit-optimize is organized. It seems today like the process is

  1. Launch a bunch of options, given our best estimates
  2. Get those back
  3. Solve a local optimization problem with those results
  4. go to 1

However watching the dashboard it seems like there might be a couple different nested stages to this, so I probably don't have the full picture.

mrocklin commented 6 years ago

Here is why I suspect that there may be a couple of nested stages:

image

mrocklin commented 6 years ago

Thought I'd check in here. @betatim or @glouppe do either of you have time to answer the questions above in https://github.com/dask/dask-ml/issues/300#issuecomment-405347553 ?

iaroslav-ai commented 6 years ago

Hi @mrocklin , here are some thoughts:

  1. I would say that simply using joblib + dask is good enough when fitting every model takes equal amount of time. In my experience it is not always true - for example, if I train a MLPClassifier model, then depending on number of neurons selected, the computations will take different amount of time. That would mean that some cores / workers will stay idle, until all computations in batch are finished. If the differences in computations are exponential, this means a lot of idle time.
  2. I do agree that approach like the one you suggested would work quite well. Essentially evaluation of _fit_and_score function, as is done here currently in a batch, would need to be done asynchronously. This could be done by replacing this line with a code which starts n_jobs computations in parallel, and waits for one of jobs to finish. As soon as the job is finished, the result of computation should be "told" to the instance of optimization algorithm given in optimizer, and then one needs to call ask in order to obtain a new point to evaluate. Also all the variables that store evaluation history in BayesSearchCV should be updated.
  3. I guess some larger dataset used with your code would be more telling - maybe MNIST?
  4. The overhead that you see for _tell is where the estimator which models the cross - validation score w.r.t. different parameters is updated. It might take some time, especially after a larger number of iterations are done. This overhead will always happen as when new points are added to the parameter optimizer algorithm.
iaroslav-ai commented 6 years ago

Hmm there is a bit more to 2. . There are multiple ways to sample multiple points to evaluate in parallel; We currently use a strategy called "constant liar". I will put here a relevant paper in a moment.

iaroslav-ai commented 6 years ago

You can see the strategy in here (search for "constant liar" in the document). Essentially, with classical Bayesian Optimization approach, you can get only one point to evaluate, without updating the underlying model of the costly objective. What people do is they supply a "fake" objective value for a parameter that was suggested by the BO algorithm. This allows to sample a different point. A process is repeated untill you have sufficient number of points.

iaroslav-ai commented 6 years ago

This is done here for example in the scikit-optimize code. One would need to do something similar in order to get initial batch of points to evaluate, and in order to generate new points when multiple processes are not finished.

iaroslav-ai commented 6 years ago

So to elaborate on your comment, the following is happening:

iaroslav-ai commented 6 years ago

About nested stages: what you might see is that the surrogate of the objective is being minimized multiple times using LBFGS with random starting points, in order to avoid to reasonable extent local minima problem. The blue rectangles are the calls to minimization functions, right?