Possible bugs in GreedyRuleListClassifier

mcschmitz commented 1 year ago

Hi!

I am not 100% if the following issues I encountered are bug so please bare with me if I simply misunderstood the GreedyRuleListClassifier.

After the rule in the fit method is defined the probability of the left node is assigned as np.mean(y) in this line. However, shouldn't it be np.mean(y_left) as the value should define the conditional probability for a positive prediction (= a prediction of class 1) as soon as a sample ends up in this node? From what I understood a sample can only terminate in the left leaf node after the application of all rules if none of the conditions are met. But then it should get assigned the conditional probably for a positive prediction under the condition that it ended up in that leave. While right now it gets assigned the probability of the 2nd to last last node.
When running the prediction and the underlying predict_proba method, shouldn't the condition be flipped as soon as the rule required flipping right and left node which is handled in this and the following lines? The reason I'm asking is the following small example of dataset that is perfectly separable by applying a single rule
```
import numpy as np
from imodels import GreedyRuleListClassifier
from sklearn.metrics import confusion_matrix
```

X = np.array([0.8, 0.8, 0.3, 0.3, 0.3, 0.3]).reshape(-1, 1) y = np.array([0, 0, 1, 1, 1, 1])

model = GreedyRuleListClassifier(max_depth=1) model.fit(X, y, verbose=True) y_pred = model.predict(X) print(confusion_matrix(y, y_pred))

[[0 2] [4 0]]

If I change the `predict_proba` method to something like this it returns the expected results.

```python
from sklearn.utils.validation import check_array, check_is_fitted

class MyGreedyRuleListClassifier(GreedyRuleListClassifier):
    def predict_proba(self, X):
        check_is_fitted(self)
        X = check_array(X)
        n = X.shape[0]
        probs = np.zeros(n)
        for i in range(n):
            x = X[i]
            for j, rule in enumerate(self.rules_):
                if j == len(self.rules_) - 1:
                    probs[i] = rule['val']
                    continue
                regular_condition = x[rule["index_col"]] >= rule["cutoff"]
                flipped_condition = x[rule["index_col"]] < rule["cutoff"]
                condition = flipped_condition if rule["flip"] else regular_condition
                if condition:
                    probs[i] = rule['val_right']
                    break
        return np.vstack((1 - probs, probs)).transpose()  # probs (n, 2)

model = MyGreedyRuleListClassifier(max_depth=1)
model.fit(X, y, verbose=True)
y_pred = model.predict(X)
print(confusion_matrix(y, y_pred))

[[2 0]
 [0 4]]

Let me know if you can confirm this. I can provide a PR fixing both issues in that case.

csinva commented 1 year ago

Hi @mcschmitz, wow thank you for finding this! This is a pretty major bug and I'm surprised it wasn't found earlier (GreedyRuleListClassifier isn't a very popular model, but a bug this big should have been causing problems).

A PR fixing these would be great! Let me know if you run into any trouble!

mcschmitz commented 1 year ago

Do you have some form of automation when creating the demo notebooks? If not, I can also push the updates for the notebooks.

On another note: I started with python 3.6, as this is mentioned to work on pypi. However TestBRL.test_integration_fitting & TestFIGS.test_categorical were both failing with AttributeError: 'OneHotEncoder' object has no attribute 'get_feature_names_out'. The python badge on the ReadMe mentions 3.7, but the setup.py as well as the pypi package mention 3.6. I updated on 3.7 and all tests pass, so I didn't try to fix it on 3.6. Just wanted to note that.

csinva commented 1 year ago

Thanks for making this!

There isn't currently any automation, but I will update the notebooks manually. You are right, there is an issue with 3.6 right now, I will try and fix it, but we may drop 3.6 support soon...

csinva / imodels

Possible bugs in GreedyRuleListClassifier #167