Allow the hyperparameter "max_depth = 0".

cerlymarco / linear-tree

A python library to build Model Trees with Linear Models at the leaves.

MIT License

351 stars 54 forks source link

Allow the hyperparameter "max_depth = 0". #23

Closed jckkvs closed 2 years ago

jckkvs commented 2 years ago

Thanks for the good library.

When using LinearTreeRegressor, I think that max_depth is often optimized by cross-validation.

This library allows max_depth in the range 1-20. However, depending on the dataset, simple linear regression may be suitable. Even in such a dataset, max_depth is forced to be 1 or more, so Simple Linear Regression cannot be applied properly with LinearTreeRegressor.

Of course, it is appropriate to use sklearn.linear_model.LinearRegression for such datasets.

My suggestion is to change to a program that uses base_estimator to perform regression when "max_depth = 0". With this change, LinearTreeRegressor can flexibly respond to both segmented regression and simple regression by changing hyperparameters.

cerlymarco commented 2 years ago

Hi,

Generally (for both linear and standard decision trees), setting a specific number for max_depth (let's assume 10) doesn't force the model to go to depth 10! If there is no utility in splitting the algorithm stop before arriving at step 10.

Making a practical example... if we have to predict a perfect line with a LinearTreeRegressor and max_depth=20:

>>> import numpy as np
>>> from sklearn.linear_model import LinearRegression
>>> from lineartree import LinearTreeRegressor

>>> X = np.arange(100).reshape(-1,1)
>>> y = np.arange(100)

>>> lt = LinearTreeRegressor(LinearRegression(), max_depth=20).fit(X,y)
... {0: {'loss': 0.0, 'models': LinearRegression(), 'samples': 100}}

Only one LinearRegression is fitted

If you support the project don't forget to leave a star ;-)

jckkvs commented 2 years ago

Thanks.I misunderstood the max_depth specification because I verified max_depth of LTR based on virtual data with inflection points.

jckkvs commented 2 years ago

Hi.

I tried the code you suggested. But the result was different.

import numpy as np
from sklearn.linear_model import LinearRegression
from lineartree import LinearTreeRegressor

X = np.arange(100).reshape(-1,1)
y = np.arange(100)

lt = LinearTreeRegressor(LinearRegression(), max_depth=20).fit(X,y)
lt.summary()

{0: {'col': 0,  'th': 37.5,  'loss': 0.0,'samples': 100,  'children': (1, 2),'models': (LinearRegression(),　LinearRegression())},
 1: {'loss': 0.0, 'samples': 38, 'models': LinearRegression()},
 2: {'loss': 0.0, 'samples': 62, 'models': LinearRegression()}}

python 3.7.11 sklearn 0.24.2 numpy 1.20.3 lineatree 0.3.3

I've tried a few other examples, If the number of samples is 11 or less, only one LinearRegression is fitted.

cerlymarco commented 2 years ago

Here is the running notebook for reproducibility.

EDIT: This is may also due to the numeric precision of your environment... where a loss of (for example) 5.429976129669105e-29 is not equal to 0.0 so the tree continues to grow. This is automatically limited (setting a fixed rounding precision) in lineartree>=0.3.4