Input matrix X containing a constant column for `LinearRegression` is more complicated than that in `scikit-learn`

abess-team / abess

Fast Best-Subset Selection Library

https://abess.readthedocs.io/

Other

474 stars 41 forks source link

Input matrix X containing a constant column for `LinearRegression` is more complicated than that in `scikit-learn` #486

Closed belzheng closed 1 year ago

belzheng commented 1 year ago

When the input matrix X contains a constant column, the LinearRegression() class in abess package makes prediction with nan instead of estimated values, which is the case of scikit-learn class LassoCV(). One way to avoid this is that we set the parameter is_normal=False, however, this is not the way user likes and scikit-learn works. Since I have encountered this kind of thing many times，I wonder if there is any possible that you can optimize this API. The following codes describe the case concisely:

Mamba413 commented 1 year ago

@belzheng ，would you please paste your code here? thx!

belzheng commented 1 year ago

@belzheng ，would you please paste your code here? thx!

Here is the code:

import numpy
from pyearth import Earth
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from abess import LinearRegression

numpy.random.seed(0)
m = 1000
n = 10
X = 80*numpy.random.uniform(size=(m,n)) - 40
y = numpy.abs(X[:,6] - 4.0) + 1*numpy.random.normal(size=m)
X_train, X_test, y_train, y_test = train_test_split(X, y)

model = Earth(enable_pruning=False)
model.fit(X_train,y_train)
X_test_new = model.transform(X_test)
X_train_new = model.transform(X_train)
print(X_train_new)

rega = LinearRegression()
rega.fit(X_train_new, y_train,)
ya_pred = rega.predict(X_test_new)
print(ya_pred)

rega = LinearRegression()
rega.fit(X_train_new, y_train, is_normal=False)
ya_pred = rega.predict(X_test_new)
print(ya_pred)

#lasso
from sklearn.linear_model import LassoCV
reglasso = LassoCV()
reglasso.fit(X_train_new, y_train)
ylasso_pred = reglasso.predict(X_test_new)
print(ylasso_pred)

oooo26 commented 1 year ago

I think the main difference is that LassoCV does not consider normalization. They can normalize data with sklearn.preprocessing and drop constant in advance.

But nan is surely annoying... We may need to disable normalization when there is constant col and give a warning? (If not, just simply disable normalization by default?)

Mamba413 commented 1 year ago

@oooo26 , yup... but if we pose a warning to users, we have to check constant cols in advance.

Mamba413 commented 1 year ago

But if it is not time expensive, I think it is OK.

Mamba413 commented 1 year ago

Does is_normal speedup abess?

oooo26 commented 1 year ago

Does is_normal speedup abess?

I have tested on linear/logistic and there seems no obvious difference on speed. Besides, the main algorithm is the same whether normalize or not.

If we want to check constant cols, I think pd.nunique can help (in Python's side).

Mamba413 commented 1 year ago

So, why we set is_normal in our API? It is designed by @Jiang-Kangkang ?

oooo26 commented 1 year ago

Yes I think. And actually scikit-learn had provided normalize at first, but it was deprecated after version 1.0 (removed after 1.2).