christophM / rulefit

Python implementation of the rulefit algorithm
MIT License
406 stars 111 forks source link

Compatibility with GridSearchCV of sklearn #49

Open jckkvs opened 1 year ago

jckkvs commented 1 year ago

In order to optimize hyperparameters using sklearn's GridSearchCV, I think it's preferable to define a score function in the estimator

from sklearn.model_selection import GridSearchCV
from rulefit import RuleFit
from sklearn.datasets import load_diabetes
model = RuleFit()
X,y = load_diabetes(return_X_y=True)
param_grid = {"tree_size":[4,6]}
gcv = GridSearchCV(model, param_grid=param_grid)
gcv.fit(X,y)
TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator RuleFit(max_iter=1000) does not.

As shown below, we can avoid errors by applying scoring to GridSearchCV, so it is possible to use GridSearchCV even now.

gcv = GridSearchCV(model, param_grid=param_grid, scoring='neg_mean_squared_error')
gcv.fit(X,y)

However, fitting with gcv's bestestimator gives an error.

gcv.best_estimator_.fit(X,y)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_9504/1281572257.py in <module>
----> 1 gcv.best_estimator_.fit(X,y)

~\Anaconda3\envs\\lib\site-packages\rulefit\rulefit.py in fit(self, X, y, feature_names)
    416                     self.tree_generator.set_params(random_state=i_size+random_state_add) # warm_state=True seems to reset random_state, such that the trees are highly correlated, unless we manually change the random_sate here.
    417                     self.tree_generator.get_params()['n_estimators']
--> 418                     self.tree_generator.fit(np.copy(X, order='C'), np.copy(y, order='C'))
    419                     curr_est_=curr_est_+1
    420                 self.tree_generator.set_params(warm_start=False)

~\Anaconda3\envs\env\lib\site-packages\sklearn\ensemble\_gb.py in fit(self, X, y, sample_weight, monitor)
    492                                  'warm_start==True'
    493                                  % (self.n_estimators,
--> 494                                     self.estimators_.shape[0]))
    495             begin_at_stage = self.estimators_.shape[0]
    496             # The requirements of _decision_function (called in two lines

ValueError: n_estimators=1 must be larger or equal to estimators_.shape[0]=552 when warm_start==True

-- version scikit-learn 0.24 and 1.0 python 3.7 RuleFit 0.3

jckkvs commented 1 year ago

I created pull request.