csinva / imodels

Interpretable ML package 🔍 for concise, transparent, and accurate predictive modeling (sklearn-compatible).
https://csinva.io/imodels
MIT License
1.36k stars 121 forks source link

[feature request] need a `verbose: int` param for each model #92

Open tigerinus opened 2 years ago

tigerinus commented 2 years ago

I have a training dataset of around 1.5m records. I was trying to get FIGSRegressor to fit it, and it's been running more than 2hrs without any indication about its progress.

It'd be great to have verbose: int param in the constructor to report what's happening within the fitting process based on the level (in int) passed to it.

E.g.

ensemble.RandomForestRegressor(n_jobs=-1, random_state=rand_state, verbose=1)
ensemble.BaggingRegressor(n_jobs=-1, random_state=rand_state, verbose=1)
xgb.XGBRegressor(verbosity=1, booster='gbtree', n_jobs=-1, random_state=rand_state)
lgb.LGBMRegressor(num_leaves=2047, random_state=rand_state, force_col_wise=True, verbose=1)

Thanks.

csinva commented 2 years ago

Thanks for pointing this out - it's a good idea and we'll add it soon!

In the meantime, you can set the max_rules parameter in the FIGSRegressor to some reasonable number (e.g. 12) to make training much faster!

tigerinus commented 2 years ago

Thanks for pointing this out - it's a good idea and we'll add it soon!

In the meantime, you can set the max_rules parameter in the FIGSRegressor to some reasonable number (e.g. 12) to make training much faster!

After setting a max_rules param, the fitting process ended up with a KeyError:

KeyError                                  Traceback (most recent call last)
<ipython-input-4-da8bf48375c8> in <module>
     31     print(f'{clf_name} training time: {t2-t1} seconds')
     32 
---> 33     y_predicted = clf.predict(X_validate)
     34     score_1 = metrics.mean_squared_error(y_validate, y_predicted)
     35     #score_2 = metrics.mean_squared_log_error(y_validate, y_predicted)

~/usr/lib64/python3.8/site-packages/imodels/tree/figs.py in predict(self, X)
    270         preds = np.zeros(X.shape[0])
    271         for tree in self.trees_:
--> 272             preds += self.predict_tree(tree, X)
    273         if self.prediction_task == 'regression':
    274             return preds

~/usr/lib64/python3.8/site-packages/imodels/tree/figs.py in predict_tree(self, root, X)
    306         preds = np.zeros(X.shape[0])
    307         for i in range(X.shape[0]):
--> 308             preds[i] = predict_tree_single_point(root, X[i])
    309         return preds
    310 

~/usr/lib64/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3456             if self.columns.nlevels > 1:
   3457                 return self._getitem_multilevel(key)
-> 3458             indexer = self.columns.get_loc(key)
   3459             if is_integer(indexer):
   3460                 indexer = [indexer]

~/usr/lib64/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 0

Since this is the only param I am not sure if it's something I missed or is it a flaw in the code. Let me know if I need to file a bug separately.

csinva commented 2 years ago

Ah thank you for pointing this out - indeed we will fix it on our end. The issue is that currently the FIGSRegressor predict function expects a numpy array not a pandas dataframe. We will change the function to handle both types (if you want a quick work around, you can just use clf.predict(X_validate.values) for now).