GreedyRulesListClassifier rules don't describe points in all decision regions in some cases

davidefiocco commented 1 year ago

Hi, I think I have encountered a small issue in the current implementation of GreedyRulesListClassifier (I'm on 1.3.17).

In a nutshell, fitted classifiers rules_ may not contain a description of all sets in which the rules list partition the training set, due to the fact that a max depth is reached.

The issue is a bit subtle/minor, but here's a description: when running something along the lines of

import pandas as pd
import numpy as np
import imodels

y = np.random.randint(2, size = 1000)
x1 = y + np.random.rand(1000)*2
x2 = y + np.random.rand(1000)*2
dataset = pd.DataFrame({"x1": x1, "x2": x2, "y": y})

clf = imodels.GreedyRuleListClassifier(max_depth = 3, criterion = 'gini')
clf.fit(dataset[["x1", "x2"]], dataset["y"], feature_names=["x1", "x2"])
print(clf.rules_)

the rules in output, depending on the realization of the (randomly seeded) dataset may be like:

[{'col': 'x1',
  'index_col': 0,
  'cutoff': 1.9932525157928467,
  'val': 0.2917771883289125,
  'flip': False,
  'val_right': 1.0,
  'num_pts': 1000,
  'num_pts_right': 246},
 {'col': 'x2',
  'index_col': 1,
  'cutoff': 1.9933324456214905,
  'val': 0.18473282442748093,
  'flip': False,
  'val_right': 1.0,
  'num_pts': 754,
  'num_pts_right': 99},
 {'col': 'x2',
  'index_col': 1,
  'cutoff': 1.0350375175476074,
  'val': 0.0037313432835820895,
  'flip': False,
  'val_right': 0.31007751937984496,
  'num_pts': 655,
  'num_pts_right': 387}]

What's not OK in those rules is that no info is given to describe the fate of the outstanding 1000 - (246 + 99 + 387) = 268 points not involved in the rule set (e.g. those not counted by the various 'num_pts_right').

A better behavior is observed instead for other random instances of the dataset e.g. with the rules set

[{'col': 'x2',
  'index_col': 1,
  'cutoff': 1.0142560601234436,
  'val': 0.010380622837370242,
  'flip': False,
  'val_right': 0.6877637130801688,
  'num_pts': 1000,
  'num_pts_right': 711},
 {'col': 'x2',
  'index_col': 1,
  'cutoff': 0.9986478090286255,
  'val': 0.0,
  'flip': False,
  'val_right': 0.3333333333333333,
  'num_pts': 289,
  'num_pts_right': 9},
 {'val': 0.0, 'num_pts': 280}]

In which the classification rules involve explicitly all 711 + 9 + 280 = 1000 points (with the outstanding ones included in num_pts in the last rule).

The issue may be hard to notice, but can be seen by wrapping in a for loop the code above, and printing the sum of the points dealt with by the rules, by adding the snippet (for debugging):

# print the count of points dealt with by the rules
count_pts = 0
for r in clf.rules_:
    if "num_pts_right" in r.keys():
        count_pts += r["num_pts_right"]
    else:
        count_pts += r["num_pts"]
print(f"Counted points: {count_pts}")

When doing so, running the code above repeatedly with different random instances of dataset one should see the issue popping up occasionally, getting in stdout something like

Counted points: 1000
Counted points: 1000
Counted points: 724 <- glitch
Counted points: 1000
Counted points: 743 <- glitch
Counted points: 1000
Counted points: 1000
Counted points: 720 <- glitch
Counted points: 1000
...

One possible fix (needs review @csinva !) would be the replacement of the current

https://github.com/csinva/imodels/blob/1243240fec3aae33852ba680ba6aea66a4f86ca7/imodels/rule_list/greedy_rule_list.py#L61-L63

with

        # base case 3: max depth reached
        elif depth == self.max_depth:
            return [{'val': np.mean(y), 'num_pts': y.size}]

thus providing a count and stats for points when max_depth is reached. This should allow rules to partition explicitly all points in the training set.

This needs checking, but I believe that may work, and could possibly be addressed together with https://github.com/csinva/imodels/issues/169. Thanks a lot!

csinva commented 1 year ago

Thanks for taking the time to document this issue and propose a fix! Very appreciated :)

I've just looked through, and this fix looks good to me! I'll change it now.

csinva commented 1 year ago

Fixed in cee96eff8d254f712f7d858fe5be8aeecd138911

csinva / imodels

GreedyRulesListClassifier rules don't describe points in all decision regions in some cases #170