csinva / imodels

Interpretable ML package 🔍 for concise, transparent, and accurate predictive modeling (sklearn-compatible).
https://csinva.io/imodels
MIT License
1.36k stars 121 forks source link

BayesianRuleListClassifier Type Error with Categorical Features #33

Closed ryanallen82 closed 3 years ago

ryanallen82 commented 3 years ago

I'm getting the following error when I try to use a string variable in my dataset:


TypeError Traceback (most recent call last)

in ----> 1 brl.fit(X_train, y_train, undiscretized_features=['agag_id']) ~/opt/anaconda3/lib/python3.8/site-packages/imodels/rule_list/bayesian_rule_list/bayesian_rule_list.py in fit(self, X, y, feature_labels, undiscretized_features, verbose) 119 raise Exception("Only binary classification is supported at this time!") 120 --> 121 itemsets, self.discretizer = extract_fpgrowth(X, y, 122 feature_labels=feature_labels, 123 minsupport=self.minsupport, ~/opt/anaconda3/lib/python3.8/site-packages/imodels/util/extract.py in extract_fpgrowth(X, y, feature_labels, minsupport, maxcardinality, undiscretized_features, verbose) 31 32 discretizer = BRLDiscretizer(X, y, feature_labels=feature_labels, verbose=verbose) ---> 33 X = discretizer.discretize_mixed_data(X, y, undiscretized_features) 34 X_df_onehot = discretizer.onehot_df 35 ~/opt/anaconda3/lib/python3.8/site-packages/imodels/util/discretization/mdlp.py in discretize_mixed_data(self, X, y, undiscretized_features) 286 "Warning: non-categorical data found. Trying to discretize. (Please convert categorical values to " 287 "strings, and/or specify the argument 'undiscretized_features', to avoid this.)") --> 288 X = self.discretize(X, y) 289 290 self.discretized_X = X ~/opt/anaconda3/lib/python3.8/site-packages/imodels/util/discretization/mdlp.py in discretize(self, X, y) 297 print("Discretizing ", self.discretized_features, "...") 298 D = pd.DataFrame(np.hstack((X, np.array(y).reshape((len(y), 1)))), columns=list(self.feature_labels) + ["y"]) --> 299 self.discretizer = MDLP_Discretizer(dataset=D, class_label="y", features=self.discretized_features) 300 301 cat_data = pd.DataFrame(np.zeros_like(X)) ~/opt/anaconda3/lib/python3.8/site-packages/imodels/util/discretization/mdlp.py in __init__(self, dataset, class_label, out_path_data, out_path_bins, features) 59 self._cuts = {f: [] for f in self._features} 60 # get cuts for all features ---> 61 self.all_features_accepted_cutpoints() 62 # discretize self._data 63 self.apply_cutpoints(out_data_path=out_path_data, out_bins_path=out_path_bins) ~/opt/anaconda3/lib/python3.8/site-packages/imodels/util/discretization/mdlp.py in all_features_accepted_cutpoints(self) 218 ''' 219 for attr in self._features: --> 220 self.single_feature_accepted_cutpoints(feature=attr) 221 return 222 ~/opt/anaconda3/lib/python3.8/site-packages/imodels/util/discretization/mdlp.py in single_feature_accepted_cutpoints(self, feature, partition_index) 190 return 191 # determine whether to cut and where --> 192 cut_candidate = self.best_cut_point(data=data_partition, feature=feature) 193 if cut_candidate == None: 194 return ~/opt/anaconda3/lib/python3.8/site-packages/imodels/util/discretization/mdlp.py in best_cut_point(self, data, feature) 160 :return: value of cut point with highest information gain (if many, picks first). None if no candidates 161 ''' --> 162 candidates = self.boundaries_in_partition(data=data, feature=feature) 163 # candidates = self.feature_boundary_points(data=data, feature=feature) 164 if not candidates: ~/opt/anaconda3/lib/python3.8/site-packages/imodels/util/discretization/mdlp.py in boundaries_in_partition(self, data, feature) 151 ''' 152 range_min, range_max = (data[feature].min(), data[feature].max()) --> 153 return set([x for x in self._boundaries[feature] if (x > range_min) and (x < range_max)]) 154 155 def best_cut_point(self, data, feature): ~/opt/anaconda3/lib/python3.8/site-packages/imodels/util/discretization/mdlp.py in (.0) 151 ''' 152 range_min, range_max = (data[feature].min(), data[feature].max()) --> 153 return set([x for x in self._boundaries[feature] if (x > range_min) and (x < range_max)]) 154 155 def best_cut_point(self, data, feature): TypeError: '>' not supported between instances of 'numpy.ndarray' and 'str'
csinva commented 3 years ago

Hello, thanks for your interest in the package! We have added some code that deals with string features for BRL. However, much like when using scikit-learn models, best practice is to first encode your string features as binary variables before fitting and our models assume that this is the case.