Open jakirkham opened 6 years ago
Note: I think that all the pieces should be in place thanks to dask-glm. This should be a matter of translating the scikit-learn API to a linear regression with dask-glm's L1 regularizer.
Do you have any code snippets that I should look at for trying to do something like this?
I think that
from dask_ml.datasets import make_regression
from dask_glm.regularizers import L1
from dask_glm.estimators import LinearRegression
X, y = make_regression(n_samples=1000, chunks=100)
lr = LinearRegression(regularizer=L1())
lr.fit(X, y)
Is basically correct. I haven't looked at the various options for scikit-learn's Lasso.
Hmm...so when scikit-learn
implements these sorts of things, they seem to support a vector or matrix for y
. However it seems that dask-glm
only supports a vector for y
. Do you know why that is? Would it be possible to change it? If so, how difficult would that be?
Edit: Have migrated this concern to issue ( https://github.com/dask/dask-ml/issues/201 ).
It should certainly be possible, but I'm not sure offhand how much work it'll be.
On Wed, Jun 6, 2018 at 11:27 AM, jakirkham notifications@github.com wrote:
Hmm...so when scikit-learn implements these sorts of things, they seem to support a vector or matrix for y. However it seems that dask-glm only supports a vector for y. Do you know why that is? Would it be possible to change it? If so, how difficult would that be?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/101#issuecomment-395130937, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHItovblUXI3wAKz-pSACqR8hl-Wt6ks5t6AMEgaJpZM4QjFS7 .
Just checking that by matrix, do you mean ndarray with two dimensions, or do you mean an np.matrix object?
If the former then is it an array that could be squeezed, or is there something more complex here with multiple labels?
On Thu, Jun 7, 2018 at 8:54 AM, Tom Augspurger notifications@github.com wrote:
It should certainly be possible, but I'm not sure offhand how much work it'll be.
On Wed, Jun 6, 2018 at 11:27 AM, jakirkham notifications@github.com wrote:
Hmm...so when scikit-learn implements these sorts of things, they seem to support a vector or matrix for y. However it seems that dask-glm only supports a vector for y. Do you know why that is? Would it be possible to change it? If so, how difficult would that be?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/101#issuecomment-395130937, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHItovblUXI3wAKz- pSACqR8hl-Wt6ks5t6AMEgaJpZM4QjFS7
.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/101#issuecomment-395409704, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszFAeOKV2HotC1dGZEdqpw_Y5FrZeks5t6SKBgaJpZM4QjFS7 .
Meaning 2-D ndarray
(though it is a fair question). Should add that scikit-learn
typically coerces 1-D ndarray
s into singleton 2-D ndarray
s when 2-D ndarray
s are allowed.
Not sure whether squeezing make sense. More likely iterating over the 1-D slices and fitting them independently would make sense, which appears to be what scikit-learn
is doing. So this should benefit quite nicely from Distributed.
+1, interested in this as well. The provided code
from dask_ml.datasets import make_regression
from dask_glm.regularizers import L1
from dask_glm.estimators import LinearRegression
X, y = make_regression(n_samples=1000, chunks=100)
lr = LinearRegression(regularizer=L1())
lr.fit(X, y)
is missing the ability to set the alpha value - the coefficients seem to point to this not being a proper lasso regression.
The following example I quickly threw together also doesn't appear to work properly, but it piggybacks on top of Dask GLM's ElasticNet the same way scikit's Lasso runs on top of scikit's ElasticNet.
family = dask_glm.families.Normal()
regularizer = dask_glm.regularizers.ElasticNet(weight=1)
b = dask_glm.algorithms.gradient_descent(X=X, y=y, max_iter=100000, family=family, regularizer=regularizer, alpha=0.01, normalize=False)
Isn't it possible to set the regularization value with the code below?
from dask_ml.datasets import make_regression
from dask_ml.linear_model import LinearRegression
X, y = make_regression(n_samples=1000, chunks=100)
lr = LinearRegression(regularizer="l1", C=1e-6)
lr.fit(X, y)
assert np.abs(lr.coef_).max() < 1e-3, "C=1e-6 should produce mostly 0 coefs"
C
and alpha
/lamduh
control the strength of the regularization (but might be inverses of each other).
Isn't it possible to set the regularization value with the code below?
from dask_ml.datasets import make_regression from dask_ml.linear_model import LinearRegression X, y = make_regression(n_samples=1000, chunks=100) lr = LinearRegression(regularizer="l1", C=1e-6) lr.fit(X, y) assert np.abs(lr.coef_).max() < 1e-3, "C=1e-6 should produce mostly 0 coefs"
C
andalpha
/lamduh
control the strength of the regularization (but might be inverses of each other).
Indeed this is what I was missing, appreciate the pointer!
Given a small C, the regression does appear to function similarly to Lasso given a small C (implying a large alpha). However, you're also right in that C is inverse of the alpha parameter.
Scikit's documentation says alpha = 1/2C where C is given in other linreg libraries. So an alpha of 0.01 should correspond with a C of 50.
However, with the following code comparing the outputs of both scikit's lasso and Dask's "lasso"
from sklearn.linear_model import Lasso
from dask_ml.datasets import make_regression
from dask_ml.linear_model import LinearRegression
X, y = make_regression(n_samples=1000, chunks=100)
lr = LinearRegression(penalty='l1', C=50, fit_intercept=False)
lr.fit(X, y)
r = Lasso(alpha=0.01, fit_intercept=False)
r.fit(X.compute(), y.compute())
print(lr.coef_)
print(r.coef_)
The coefficients for the dask model fit appear unstable. For very small C, they do look the same.
I'm no ML expert - in fact I'm just slapping some code together - but it seems like there's definitely an inverse relationship, just not one that's 1/2C. Which would be fine, except the performance of dask ml at very small C is several times worse than scikit - about 30x worse, for values of C and alpha that empirically appear to give very similar coefficients.
Is there something else I am missing here? Or is this performance slowdown to be expected.
except the performance of dask ml at very small C is several times worse than scikit - about 30x worse, for values of C and alpha that empirically appear to give very similar coefficients.
What do you mean "30× worse"? I'm not sure I'd expect Dask-ML any kind of timing acceleration with a small array.
C and alpha that empirically appear to give very similar coefficients.
I've verified that C and alpha give very similar coefficients. The two sets of coefficients are very close with relative error, a standard benchmark in optimization:
# script above
import numpy.linalg as LA
rel_error = LA.norm(lr.coef_ - r.coef_) / LA.norm(r.coef_)
print(rel_error) # 0.00172; very small. Two vectors have close euclidean distance
print(np.abs(r.coef_).max()) # 89.2532; the scikit-learn coefs are large
print(np.abs(lr.coef_ - r.coef_).mean()) # 0.01543; the mean error is small
print(np.abs(lr.coef_ - r.coef_).max()) # 0.10180; the max error is still pretty large
print(np.median(np.abs(lr.coef_ - r.coef_))) # 0.01077; not as expected (1e-3 or 1e-4 expected), fair given debugging
Would be good to add Lasso.