Open mrocklin opened 7 years ago
Testing notebook: https://gist.github.com/58ebc10424acd99d4514003e6d978076
@mrocklin Has gradient_descent
been optimized (using delayed, persist, etc.) in the same way that the other functions have? I might be refactoring soon and I wanted to make sure that piece was taken care of first.
I think so
On Fri, Feb 10, 2017 at 10:02 AM, Chris White notifications@github.com wrote:
@mrocklin https://github.com/mrocklin Has gradient_descent been optimized (using delayed, persist, etc.) in the same way that the other functions have? I might be refactoring soon and I wanted to make sure that piece was taken care of first.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-glm/issues/26#issuecomment-278983793, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszM01FIsKzfBknsvirg5sBt1agtVmks5rbIoJgaJpZM4L2e93 .
Note that @eriknw is working on a dask optimization that may help to reduce overhead here: https://github.com/dask/dask/pull/1979
I sat down with @amueller and we compare with sklearn's SGD. We found that proximal_grad and sklearn.SGD were similar in terms of runtime on a single machine (using dask.distributed, we didn't try the threaded scheduler). Presumably SGD was being a bit smarter and dask-glm was using more hardware.
@mrocklin Did you look at ADMM? I'm currently starting to think that, going forward, we only employ ADMM, Newton, and gradient_descent.
Nope, we only spent a few minutes on it. We ran the following:
import dask.array as da
import numpy as np
from dask_glm.logistic import *
from dask_glm.utils import *
from distributed import Client
c = Client('localhost:8786')
N = 1e7
chunks = 1e6
seed = 20009
X = da.random.random((N,2), chunks=chunks)
y = make_y(X, beta=np.array([-1.5, 3]), chunks=chunks)
X, y = persist(X, y)
%time proximal_grad(X,y)
from sklearn.linear_model import SGDClassifier
nX, ny = compute(X, y)
%time sgd = SGDClassifier(loss='log', n_iter=10, verbose=10, fit_intercept=False).fit(nX, ny)
I haven't looked into whether we could use this data for benchmarking, but the incredibly large dataset over at https://www.kaggle.com/c/outbrain-click-prediction/data seems like it could be a good candidate. We might have to process the data a little bit before fitting a model, but I wouldn't mind taking a stab at that piece.
cc: @hussainsultan @jcrist
On eight m4.2xlarges I created the following dataset
I then ran the various methods within this project and recorded the profiles as bokeh plots. They are linked to below:
Additionally, I ran against a 10x larger dataset and got the following results
Most runtimes were around a minute. The BFGS solution gave wrong results.
Notes
On larger problems with smallish chunks (8 4 1e6 == 24 MB) we seem to be bound by scheduling overhead. I've created an isolated benchmark here that is representative of this overhead: https://gist.github.com/mrocklin/48b7c4b610db63b2ee816bd387b5a328