dask / dask-glm

BSD 3-Clause "New" or "Revised" License
75 stars 46 forks source link

Profile results #26

Open mrocklin opened 7 years ago

mrocklin commented 7 years ago

On eight m4.2xlarges I created the following dataset

N = 1e8
beta = np.array([-1, 0, 1, 2])
M = 4
chunks = 1e6
seed = 20009

X = da.random.random((N, M), chunks=(chunks, M))
z0 = X.dot(beta)
y = da.random.random(z0.shape, chunks=z0.chunks) < sigmoid(z0)

X, y = persist(X, y)

I then ran the various methods within this project and recorded the profiles as bokeh plots. They are linked to below:

Additionally, I ran against a 10x larger dataset and got the following results

Most runtimes were around a minute. The BFGS solution gave wrong results.

Notes

On larger problems with smallish chunks (8 4 1e6 == 24 MB) we seem to be bound by scheduling overhead. I've created an isolated benchmark here that is representative of this overhead: https://gist.github.com/mrocklin/48b7c4b610db63b2ee816bd387b5a328

mrocklin commented 7 years ago

Testing notebook: https://gist.github.com/58ebc10424acd99d4514003e6d978076

cicdw commented 7 years ago

@mrocklin Has gradient_descent been optimized (using delayed, persist, etc.) in the same way that the other functions have? I might be refactoring soon and I wanted to make sure that piece was taken care of first.

mrocklin commented 7 years ago

I think so

On Fri, Feb 10, 2017 at 10:02 AM, Chris White notifications@github.com wrote:

@mrocklin https://github.com/mrocklin Has gradient_descent been optimized (using delayed, persist, etc.) in the same way that the other functions have? I might be refactoring soon and I wanted to make sure that piece was taken care of first.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-glm/issues/26#issuecomment-278983793, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszM01FIsKzfBknsvirg5sBt1agtVmks5rbIoJgaJpZM4L2e93 .

mrocklin commented 7 years ago

Note that @eriknw is working on a dask optimization that may help to reduce overhead here: https://github.com/dask/dask/pull/1979

mrocklin commented 7 years ago

I sat down with @amueller and we compare with sklearn's SGD. We found that proximal_grad and sklearn.SGD were similar in terms of runtime on a single machine (using dask.distributed, we didn't try the threaded scheduler). Presumably SGD was being a bit smarter and dask-glm was using more hardware.

cicdw commented 7 years ago

@mrocklin Did you look at ADMM? I'm currently starting to think that, going forward, we only employ ADMM, Newton, and gradient_descent.

mrocklin commented 7 years ago

Nope, we only spent a few minutes on it. We ran the following:

Prep

import dask.array as da
import numpy as np
from dask_glm.logistic import *
from dask_glm.utils import *

from distributed import Client
c = Client('localhost:8786')

N = 1e7
chunks = 1e6
seed = 20009

X = da.random.random((N,2), chunks=chunks)
y = make_y(X, beta=np.array([-1.5, 3]), chunks=chunks)

X, y = persist(X, y)

Dask GLM

%time proximal_grad(X,y)

SKLearn

from sklearn.linear_model import SGDClassifier
nX, ny = compute(X, y)
%time sgd = SGDClassifier(loss='log', n_iter=10, verbose=10, fit_intercept=False).fit(nX, ny)
cicdw commented 7 years ago

I haven't looked into whether we could use this data for benchmarking, but the incredibly large dataset over at https://www.kaggle.com/c/outbrain-click-prediction/data seems like it could be a good candidate. We might have to process the data a little bit before fitting a model, but I wouldn't mind taking a stab at that piece.

cc: @hussainsultan @jcrist