dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
885 stars 255 forks source link

Add Lasso #101

Open jakirkham opened 6 years ago

jakirkham commented 6 years ago

Would be good to add Lasso.

TomAugspurger commented 6 years ago

Note: I think that all the pieces should be in place thanks to dask-glm. This should be a matter of translating the scikit-learn API to a linear regression with dask-glm's L1 regularizer.

jakirkham commented 6 years ago

Do you have any code snippets that I should look at for trying to do something like this?

TomAugspurger commented 6 years ago

I think that

from dask_ml.datasets import make_regression
from dask_glm.regularizers import L1
from dask_glm.estimators import LinearRegression

X, y = make_regression(n_samples=1000, chunks=100)

lr = LinearRegression(regularizer=L1())
lr.fit(X, y)

Is basically correct. I haven't looked at the various options for scikit-learn's Lasso.

jakirkham commented 6 years ago

Hmm...so when scikit-learn implements these sorts of things, they seem to support a vector or matrix for y. However it seems that dask-glm only supports a vector for y. Do you know why that is? Would it be possible to change it? If so, how difficult would that be?

Edit: Have migrated this concern to issue ( https://github.com/dask/dask-ml/issues/201 ).

TomAugspurger commented 6 years ago

It should certainly be possible, but I'm not sure offhand how much work it'll be.

On Wed, Jun 6, 2018 at 11:27 AM, jakirkham notifications@github.com wrote:

Hmm...so when scikit-learn implements these sorts of things, they seem to support a vector or matrix for y. However it seems that dask-glm only supports a vector for y. Do you know why that is? Would it be possible to change it? If so, how difficult would that be?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/101#issuecomment-395130937, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHItovblUXI3wAKz-pSACqR8hl-Wt6ks5t6AMEgaJpZM4QjFS7 .

mrocklin commented 6 years ago

Just checking that by matrix, do you mean ndarray with two dimensions, or do you mean an np.matrix object?

If the former then is it an array that could be squeezed, or is there something more complex here with multiple labels?

On Thu, Jun 7, 2018 at 8:54 AM, Tom Augspurger notifications@github.com wrote:

It should certainly be possible, but I'm not sure offhand how much work it'll be.

On Wed, Jun 6, 2018 at 11:27 AM, jakirkham notifications@github.com wrote:

Hmm...so when scikit-learn implements these sorts of things, they seem to support a vector or matrix for y. However it seems that dask-glm only supports a vector for y. Do you know why that is? Would it be possible to change it? If so, how difficult would that be?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/101#issuecomment-395130937, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHItovblUXI3wAKz- pSACqR8hl-Wt6ks5t6AMEgaJpZM4QjFS7

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/101#issuecomment-395409704, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszFAeOKV2HotC1dGZEdqpw_Y5FrZeks5t6SKBgaJpZM4QjFS7 .

jakirkham commented 6 years ago

Meaning 2-D ndarray (though it is a fair question). Should add that scikit-learn typically coerces 1-D ndarrays into singleton 2-D ndarrays when 2-D ndarrays are allowed.

Not sure whether squeezing make sense. More likely iterating over the 1-D slices and fitting them independently would make sense, which appears to be what scikit-learn is doing. So this should benefit quite nicely from Distributed.

valkmit commented 1 year ago

+1, interested in this as well. The provided code

from dask_ml.datasets import make_regression
from dask_glm.regularizers import L1
from dask_glm.estimators import LinearRegression

X, y = make_regression(n_samples=1000, chunks=100)

lr = LinearRegression(regularizer=L1())
lr.fit(X, y)

is missing the ability to set the alpha value - the coefficients seem to point to this not being a proper lasso regression.

The following example I quickly threw together also doesn't appear to work properly, but it piggybacks on top of Dask GLM's ElasticNet the same way scikit's Lasso runs on top of scikit's ElasticNet.

family = dask_glm.families.Normal()
regularizer = dask_glm.regularizers.ElasticNet(weight=1)
b = dask_glm.algorithms.gradient_descent(X=X, y=y, max_iter=100000, family=family, regularizer=regularizer, alpha=0.01, normalize=False)
stsievert commented 1 year ago

Isn't it possible to set the regularization value with the code below?

from dask_ml.datasets import make_regression
from dask_ml.linear_model import LinearRegression

X, y = make_regression(n_samples=1000, chunks=100)
lr = LinearRegression(regularizer="l1", C=1e-6)
lr.fit(X, y)
assert np.abs(lr.coef_).max() < 1e-3, "C=1e-6 should produce mostly 0 coefs"

C and alpha/lamduh control the strength of the regularization (but might be inverses of each other).

valkmit commented 1 year ago

Isn't it possible to set the regularization value with the code below?

from dask_ml.datasets import make_regression
from dask_ml.linear_model import LinearRegression

X, y = make_regression(n_samples=1000, chunks=100)
lr = LinearRegression(regularizer="l1", C=1e-6)
lr.fit(X, y)
assert np.abs(lr.coef_).max() < 1e-3, "C=1e-6 should produce mostly 0 coefs"

C and alpha/lamduh control the strength of the regularization (but might be inverses of each other).

Indeed this is what I was missing, appreciate the pointer!

valkmit commented 1 year ago

Given a small C, the regression does appear to function similarly to Lasso given a small C (implying a large alpha). However, you're also right in that C is inverse of the alpha parameter.

Scikit's documentation says alpha = 1/2C where C is given in other linreg libraries. So an alpha of 0.01 should correspond with a C of 50.

However, with the following code comparing the outputs of both scikit's lasso and Dask's "lasso"

from sklearn.linear_model import Lasso
from dask_ml.datasets import make_regression
from dask_ml.linear_model import LinearRegression

X, y = make_regression(n_samples=1000, chunks=100)
lr = LinearRegression(penalty='l1', C=50, fit_intercept=False)
lr.fit(X, y)

r = Lasso(alpha=0.01, fit_intercept=False)
r.fit(X.compute(), y.compute())

print(lr.coef_)
print(r.coef_)

The coefficients for the dask model fit appear unstable. For very small C, they do look the same.

I'm no ML expert - in fact I'm just slapping some code together - but it seems like there's definitely an inverse relationship, just not one that's 1/2C. Which would be fine, except the performance of dask ml at very small C is several times worse than scikit - about 30x worse, for values of C and alpha that empirically appear to give very similar coefficients.

Is there something else I am missing here? Or is this performance slowdown to be expected.

stsievert commented 1 year ago

except the performance of dask ml at very small C is several times worse than scikit - about 30x worse, for values of C and alpha that empirically appear to give very similar coefficients.

What do you mean "30× worse"? I'm not sure I'd expect Dask-ML any kind of timing acceleration with a small array.

C and alpha that empirically appear to give very similar coefficients.

I've verified that C and alpha give very similar coefficients. The two sets of coefficients are very close with relative error, a standard benchmark in optimization:

# script above
import numpy.linalg as LA
rel_error = LA.norm(lr.coef_ - r.coef_) / LA.norm(r.coef_)
print(rel_error)  # 0.00172; very small. Two vectors have close euclidean distance

print(np.abs(r.coef_).max())  # 89.2532; the scikit-learn coefs are large
print(np.abs(lr.coef_ - r.coef_).mean())  # 0.01543; the mean error is small
print(np.abs(lr.coef_ - r.coef_).max())  # 0.10180; the max error is still pretty large
print(np.median(np.abs(lr.coef_ - r.coef_))) # 0.01077; not as expected (1e-3 or 1e-4 expected), fair given debugging