Open mrocklin opened 6 years ago
cc @betatim, who seems to be the leading developer of scikit-optimize
I would love to see closer integration of distributed
and scikit-optimize
! :heart:
Most algorithms in scikit-optimize
are a bit tricky to parallelise (beyond say four or so instances). Basic reasoning: we look at past results to pick "the best" next point. So if we have to pick the 16 next points now to evaluate in parallel we will make a worse decision than if we do four batches of four where we can use the feedback of earlier batches when picking the next batch.
What is the starting point to use dask in scikit-optimize? Would we aim to make something like https://github.com/dask/dask-examples/blob/master/machine-learning.ipynb work?
So if we have to pick the 16 next points now to evaluate in parallel we will make a worse decision than if we do four batches of four where we can use the feedback of earlier batches when picking the next batch.
Agreed. This seems to be the nature of speculative work. My guess would be that we can still get some value out of exploration, even if it's not optimal. In many cases the extra computation is somewhat free and people are willing to waste a bit.
I don't know this problem space well though, so I may be missing something important.
What is the starting point to use dask in scikit-optimize? Would we aim to make something like https://github.com/dask/dask-examples/blob/master/machine-learning.ipynb work?
I don't know much about the practice of machine learning, but yes, I imagine that replacing GridSearchCV with something that was a bit more intelligent in that problem would be a good start. You might also be able to suggest a better example to parallelize.
I will have a think and see if someone wants to try out what happens if we switch out the backend.
The comment about parallelising was more a caveat emptor as often people find scikit-optimize and are surprised by the relatively low parallel work factor/excitement of doing things in parallel.
@iaroslav-ai @glouppe should we discuss here or create a new issue in scikit-optimize?
I dont mind having the discussion here.
I am all for a better integration with Dask, provided we clearly define what that means / that requires from us.
So I tried things today with joblib and the following example:
from dask.distributed import Client, progress
import distributed.joblib # needed until sklearn 0.20
client = Client(processes=False)
from skopt import BayesSearchCV
# parameter ranges are specified by one of below
from skopt.space import Real, Categorical, Integer
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.externals.joblib import parallel_backend
X, y = load_iris(True)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75,
random_state=0)
# log-uniform: understand as search over p = exp(x) by varying x
opt = BayesSearchCV(
SVC(),
{
'C': Real(1e-6, 1e+6, prior='log-uniform'),
'gamma': Real(1e-6, 1e+1, prior='log-uniform'),
'degree': Integer(1,8),
'kernel': Categorical(['linear', 'poly', 'rbf']),
},
n_iter=32,
n_jobs=1,
)
with parallel_backend('dask'):
# executes bayesian optimization
opt.fit(X_train, y_train)
and the dashboard shows tasks like the following:
Each green rectangle is a call to fmin_l_bfgs_batch
, which takes around 50ms. We're seeing a lot of dead space in between tasks, likely meaning that skopt
is doing some non-trivial work in between sending out batches.
After running this experiment I have some questions:
What is the cause of the overhead here?
Actually I profiled a bit. It's spending a bit of time in scipy.optimize, coming from the skopt/optimizer/optimizer.py::Optimizer._tell
method.
It would help me make suggestions if I knew a little bit more about how the computation in scikit-optimize is organized. It seems today like the process is
However watching the dashboard it seems like there might be a couple different nested stages to this, so I probably don't have the full picture.
Here is why I suspect that there may be a couple of nested stages:
Thought I'd check in here. @betatim or @glouppe do either of you have time to answer the questions above in https://github.com/dask/dask-ml/issues/300#issuecomment-405347553 ?
Hi @mrocklin , here are some thoughts:
MLPClassifier
model, then depending on number of neurons selected, the computations will take different amount of time. That would mean that some cores / workers will stay idle, until all computations in batch are finished. If the differences in computations are exponential, this means a lot of idle time._fit_and_score
function, as is done here currently in a batch, would need to be done asynchronously. This could be done by replacing this line with a code which starts n_jobs
computations in parallel, and waits for one of jobs to finish. As soon as the job is finished, the result of computation should be "told" to the instance of optimization algorithm given in optimizer
, and then one needs to call ask
in order to obtain a new point to evaluate. Also all the variables that store evaluation history in BayesSearchCV
should be updated. _tell
is where the estimator which models the cross - validation score w.r.t. different parameters is updated. It might take some time, especially after a larger number of iterations are done. This overhead will always happen as when new points are added to the parameter optimizer algorithm. Hmm there is a bit more to 2. . There are multiple ways to sample multiple points to evaluate in parallel; We currently use a strategy called "constant liar". I will put here a relevant paper in a moment.
You can see the strategy in here (search for "constant liar" in the document). Essentially, with classical Bayesian Optimization approach, you can get only one point to evaluate, without updating the underlying model of the costly objective. What people do is they supply a "fake" objective value for a parameter that was suggested by the BO algorithm. This allows to sample a different point. A process is repeated untill you have sufficient number of points.
This is done here for example in the scikit-optimize code. One would need to do something similar in order to get initial batch of points to evaluate, and in order to generate new points when multiple processes are not finished.
So to elaborate on your comment, the following is happening:
About nested stages: what you might see is that the surrogate of the objective is being minimized multiple times using LBFGS with random starting points, in order to avoid to reasonable extent local minima problem. The blue rectangles are the calls to minimization functions, right?
The Scikit-optimize
BayesSearchCV
class likely has internal functions that help it determine new parameters given scores on a set of older parameters. It would be useful to either introduce Dask to scikit-optimize or else reuse this logic and implement our ownBayesSearchCV
. This could be used with standardfit
style algorithms or withIncremental
if we have a nice early-stopping criterion.This came out of conversation with @ogrisel