cerlymarco / linear-tree

A python library to build Model Trees with Linear Models at the leaves.
MIT License
338 stars 54 forks source link

LinearForestRegressor may give biased coefficients for base estimator #38

Closed Yimsun97 closed 6 months ago

Yimsun97 commented 6 months ago

Hi There!

I am very interesting in the linear-tree packge and I found it inspiring for my research. But when I was using LinearForestRegressor in my study, I found that the base estimator of it gave biased coefficients (with too small absolute values) so that the prediction was basically fitted by the forest estimator. Therefore the structure of liear forest will be very similar to a random forest regressor. I found that it may be due to the round off error in the source code function self._validate_data where the dtype "float32" was used.

I generated a synthetic dataset to compare the LinearRegression model in the scikit-learn and the LinearForestRegressor. BTW, how can we deal with the data with features at multiple orders of magnitudes? Will the parameter base_estimator support sklearn pipeline to support preprocessing like StandardScaler in the future release?

Thank you for your excellent works!

import numpy as np
from lineartree import LinearForestRegressor
from sklearn.linear_model import LinearRegression

SEED = 1234

# Genrate a synthetic dataset
X1 = np.random.randn(1000, 1) * 1 + 10
X2 = np.random.randn(1000, 1) * 1e7 + 3e7
X3 = np.random.randn(1000, 1) * 100 + 200
X4 = np.random.randn(1000, 1) + 500
X5 = np.random.randn(1000, 1) + 1000
X6 = np.random.randn(1000, 1)
X7 = np.random.randn(1000, 1)
X8 = np.random.rand(1000, 1)

X = np.concatenate([X1, X2, X3, X4, X5, X6, X7, X8], axis=1)
y = X1 + np.sin(X2 * X6) + (X3 / 1e6) ** 2 + X4 / 1e3 + X2 / 1e7 + \
    X7 * X8 + np.random.randn(1000, 1) * 0.1
y = np.log(y)

# Fit a linear regression model
lr = LinearRegression()
lr.fit(X, y)
lr_coef = lr.coef_
print(lr_coef) 

# this will give [[ 7.49327164e-02  7.59350553e-09 -5.17630150e-06 -1.67616079e-05
#  -1.73796325e-03  3.13294480e-04  4.07092831e-02 -7.15923013e-03]]

# Fit a linear forest model
lf = LinearForestRegressor(base_estimator=LinearRegression(),
                           n_estimators=100, max_depth=5,
                           max_features=1.0, random_state=SEED)
lf.fit(X, y)
lf_coef = lf.coef_
print(lf_coef)

# this will give [ 1.3074668e-09  7.2390938e-09 -2.1693744e-05  9.1071959e-09
# -6.6003052e-09 -7.7589535e-09  7.1229582e-09  5.3837756e-09]
cerlymarco commented 6 months ago

Hi, Did you try using LinearForestRegressor inside a pipeline with at the top a StandardScaler like:

from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(), LinearForestRegressor(...))
Yimsun97 commented 6 months ago

Hi, Did you try using LinearForestRegressor inside a pipeline with at the top a StandardScaler like:

from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(), LinearForestRegressor(...))

Thank you for your reply! I have tried the pipeline and it worked!

I've seen the lase case on the repository homepage that linear forest can be used to resolve the extrapolation issues of random forest. After I used the pipeline, I found that the R-squared of linear forest on the test set (~0.65) is lower than random forest (0.70). Is this commonly seen in the regression problems? How can I improve the fitness of linear forest or does it mean that there is a trade-off between the fitness and the extropolation ability?

Thank you!

cerlymarco commented 6 months ago

Finding the trade-off between the predictive and the extrapolation ability is one of the hardest tasks in the ML ecosystem. Some models are good for maximizing accuracy, others to extract explicative insights. There is no silver bullet for this kind of problem. You should make the proper choices according to your data and needs. All the best