LAMDA-NJU / Deep-Forest

An Efficient, Scalable and Optimized Python Framework for Deep Forest (2021.2.1)
https://deep-forest.readthedocs.io
Other
923 stars 162 forks source link

[BUG] cannot correctly clone `CascadeForestRegressor` with `sklearn.base.clone` when using customized estimators #92

Open IncubatorShokuhou opened 3 years ago

IncubatorShokuhou commented 3 years ago

Describe the bug cannot correctly clone CascadeForestClassifier/CascadeForestRegressor object with sklearn.base.clone when using customized stimators

To Reproduce

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.base import clone
from deepforest import CascadeForestRegressor
import xgboost as xgb
import lightgbm as lgb

X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
model = CascadeForestRegressor(random_state=1)

# set estimator
n_estimators = 4  # the number of base estimators per cascade layer
estimators = [lgb.LGBMRegressor(random_state=i)  for i in range(n_estimators)]
model.set_estimator(estimators)

# set predictor 
predictor = xgb.XGBRegressor()
model.set_predictor(predictor)

# clone model
model_new = clone(model)

# try to fit
model.fit(X_train, y_train)

Expected behavior No error

Additional context

~/miniconda3/envs/pycaret/lib/python3.8/site-packages/deep_forest-0.1.5-py3.8-linux-x86_64.egg/deepforest/cascade.py in fit(self, X, y, sample_weight)
   1004                 if not hasattr(self, "predictor_"):
   1005                     msg = "Missing predictor after calling `set_predictor`"
-> 1006                     raise RuntimeError(msg)
   1007 
   1008             binner_ = Binner(

RuntimeError: Missing predictor after calling `set_predictor`

This bug occours because when the model is cloned, if the model has customized predictor or estimators, predictor='custom' will be cloned, while self.predictor_ / self.dummy_estimators will not be correctly cloned, which introduced the bug described above.

I think this bug can be easily fixed by putting the predictor and the list of estimators into the parameter of CascadeForestClassifier/CascadeForestRegressor, just like the way of those meta estimators (e.g. ngboost), but maybe the corresponding APIs will have to be changed.

For example, the API parameters could be:

model = CascadeForestRegressor(
    estimators=[lgb.LGBMRegressor(random_state=i) for i in range(n_estimators)],
    predictor=xgb.XGBRegressor(),
)
xuyxu commented 3 years ago

Thanks for reporting, will take a look during the weekend.