csinva / imodels

Interpretable ML package 🔍 for concise, transparent, and accurate predictive modeling (sklearn-compatible).
https://csinva.io/imodels
MIT License
1.34k stars 119 forks source link

RuleFitClassifier not working with simple example using iris data #131

Open gialmisi opened 1 year ago

gialmisi commented 1 year ago

The following code snippet results in an error:

from sklearn.datasets import load_iris
from imodels import RuleFitClassifier

iris = load_iris()
X, Y = iris.data, iris.target
rulefit = RuleFitClassifier()
rulefit.fit(X, Y)
print(rulefit)

The error reads:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/tmp/ipykernel_208411/3401153452.py in <cell line: 9>()
      7 rulefit = RuleFitClassifier()
      8 rulefit.fit(X, Y)
----> 9 print(rulefit)

~/.cache/pypoetry/virtualenvs/xlemoo-6BFI3yUJ-py3.8/lib/python3.8/site-packages/imodels/rule_set/rule_fit.py in __str__(self)
    247         s += '> \tPredictions are made by summing the coefficients of each rule\n'
    248         s += '> ------------------------------\n'
--> 249         return s + self.visualize().to_string(index=False) + '\n'
    250 
    251     def _extract_rules(self, X, y) -> List[Rule]:

~/.cache/pypoetry/virtualenvs/xlemoo-6BFI3yUJ-py3.8/lib/python3.8/site-packages/imodels/rule_set/rule_fit.py in visualize(self, decimals)
    237 
    238     def visualize(self, decimals=2):
--> 239         rules = self._get_rules()
    240         rules = rules[rules.coef != 0].sort_values("support", ascending=False)
    241         pd.set_option('display.max_colwidth', None)

~/.cache/pypoetry/virtualenvs/xlemoo-6BFI3yUJ-py3.8/lib/python3.8/site-packages/imodels/rule_set/rule_fit.py in _get_rules(self, exclude_zero_coef, subregion)
    208         for i in range(0, n_features):
    209             if self.lin_standardise:
--> 210                 coef = self.coef[i] * self.friedscale.scale_multipliers[i]
    211             else:
    212                 coef = self.coef[i]

IndexError: index 4 is out of bounds for axis 0 with size 4

I tried to look into this issue myself, but I am not familiar enough with the method to make any definitive claims. However, this line of code seems fishy. Why not just use the actual number of features stored in self.n_features? Could be a source of the indexing error.

vruusmann commented 11 months ago

Looks like imodels classifiers only work with binary classification problems.

The iris dataset deals with a multi-class classification problem. The code snippet can be fixed by transforming the label from multi-class to binary:

from sklearn.datasets import load_iris
from imodels import RuleFitClassifier

import numpy

iris = load_iris()
X, y = iris.data, iris.target

# THIS! Predict if the iris species is "virginica" or not
y = numpy.where(y == 2, 1, 0)
#print(y)

rulefit = RuleFitClassifier()
rulefit.fit(X, y)
print(rulefit)
Gabriel-Kissin commented 5 days ago

Thanks @vruusmann. Just spent a while working this out myself, came here to report it, and found this issue. The issue can also be seen if you replace print(rulefit) with rulefit.predict(X) in the original code snippet.

Suggested action - the documentation should be changed to reflect this limitation - there is nothing here which indicates that multiclass classification won't work. (Although here, in a table of which tasks are supported by the different models, only Binary classification and Regression are mentioned).

Better still - raise an explicit error when y is multiclass, explaining that it needs to be binary.

Best of all - support multiclass classification!

Gabriel-Kissin commented 5 days ago

related to https://github.com/csinva/imodels/issues/93, https://github.com/csinva/imodels/issues/77 - so this issue is a duplicate

csinva commented 5 days ago

Thanks both for the interest in the package and for raising these issues!

50431c8ec62edd97646fbea968d71e964262761f adds code to raise an Error explaining that multiclass is not supported for RuleFitClassifier during fitting. Hopefully we can actually support multiclass classification soon...