Ekeany / Boruta-Shap

A Tree based feature selection tool which combines both the Boruta feature selection algorithm with shapley values.
MIT License
559 stars 86 forks source link

Add more model evaluation metrics #99

Open Marktus opened 2 years ago

Marktus commented 2 years ago

Description

The current BorutaShap evaluates model using either shapely importance or "gini". BorutaShap does not provide means for a user to control how shap should be run within it. E.g. allow user to set feature_perturbation="interventional", model_output="probability", etc.

Also, if a machine learning algorithm that does not produce "gini" as importance_measure, it cannot be used in BorutaShap. This is another limitation that can be improved.

Reasoning

I have run into situations where the model errored on using "gini" for randomforest. Other instances, the error was on "shap" that could be control by the user.

def boruta_shap_simulation(model, X, y): Feature_Selector = BorutaShap(model=model, importance_measure='gini', classification=True) result = Feature_Selector.fit(X=X, y=y, n_trials=1, random_state=0, train_or_test="train", normalize=True, verbose=True) return result model = RandomForestClassifier() bsa_results = boruta_shap_simulation(model, X_train_sampled, y_train_sampled)

AttributeError Traceback (most recent call last)

in () 14 return result 15 model = RandomForestClassifier() ---> 16 bsa_results = boruta_shap_simulation(model, X_train_sampled, y_train_sampled, num_processors=20) in boruta_shap_simulation(model, X, y, num_processors) 5 #pool = mp.Pool(num_processors) 6 start = time.time() ----> 7 Feature_Selector = BorutaShap(model=model, importance_measure='gini', classification=True) 8 #result = pool.imap(Feature_Selector.fit(X=X, y=y, n_trials=1, random_state=0, train_or_test="train", normalize=True, verbose=True)) 9 result = Feature_Selector.fit(X=X, y=y, n_trials=1, random_state=0, train_or_test="train", normalize=True, verbose=True) /local_disk0/.ephemeral_nfs/envs/pythonEnv-a9cc5b77-4046-405e-a6bd-a84c65420c14/lib/python3.9/site-packages/BorutaShap.py in __init__(self, model, importance_measure, classification, percentile, pvalue) 61 self.classification = classification 62 self.model = model ---> 63 self.check_model() 64 65 /local_disk0/.ephemeral_nfs/envs/pythonEnv-a9cc5b77-4046-405e-a6bd-a84c65420c14/lib/python3.9/site-packages/BorutaShap.py in check_model(self) 103 104 elif check_feature_importance is False and self.importance_measure == 'gini': --> 105 raise AttributeError('Model must contain the feature_importances_ method to use Gini try Shap instead') 106 107 else: AttributeError: Model must contain the feature_importances_ method to use Gini try Shap instead def boruta_shap_simulation(model, X, y,): Feature_Selector = BorutaShap(model=model, importance_measure='shap', classification=True) result = Feature_Selector.fit(X=X, y=y, n_trials=1, random_state=0, train_or_test="train", normalize=True, verbose=True) return result model = RandomForestClassifier() bsa_results = boruta_shap_simulation(model, X_train_sampled, y_train_sampled) --------------------------------------------------------------------------- Exception Traceback (most recent call last) in () 15 16 model = RandomForestClassifier() ---> 17 bsa_results = boruta_shap_simulation(model, X_train_sampled, y_train_sampled, num_processors=20) in boruta_shap_simulation(model, X, y, num_processors) 7 Feature_Selector = BorutaShap(model=model, importance_measure='shap', classification=True) 8 #result = pool.imap(Feature_Selector.fit(X=X, y=y, n_trials=1, random_state=0, train_or_test="train", normalize=True, verbose=True)) ----> 9 result = Feature_Selector.fit(X=X, y=y, n_trials=1, random_state=0, train_or_test="train", normalize=True, verbose=True) 10 #pool.close() 11 #pool.join() /local_disk0/.ephemeral_nfs/envs/pythonEnv-a9cc5b77-4046-405e-a6bd-a84c65420c14/lib/python3.9/site-packages/BorutaShap.py in fit(self, X, y, n_trials, random_state, sample, train_or_test, normalize, verbose, stratify) 361 self.Check_if_chose_train_or_test_and_train_model() 362 --> 363 self.X_feature_import, self.Shadow_feature_import = self.feature_importance(normalize=normalize) 364 self.update_importance_history() 365 hits = self.calculate_hits() /local_disk0/.ephemeral_nfs/envs/pythonEnv-a9cc5b77-4046-405e-a6bd-a84c65420c14/lib/python3.9/site-packages/BorutaShap.py in feature_importance(self, normalize) 606 if self.importance_measure == 'shap': 607 --> 608 self.explain() 609 vals = self.shap_values 610 /local_disk0/.ephemeral_nfs/envs/pythonEnv-a9cc5b77-4046-405e-a6bd-a84c65420c14/lib/python3.9/site-packages/BorutaShap.py in explain(self) 732 if self.classification: 733 # for some reason shap returns values wraped in a list of length 1 --> 734 self.shap_values = np.array(explainer.shap_values(self.X_boruta)) 735 if isinstance(self.shap_values, list): 736 /databricks/python/lib/python3.9/site-packages/shap/explainers/_tree.py in shap_values(self, X, y, tree_limit, approximate, check_additivity, from_call) 406 out = self._get_shap_output(phi, flat_output) 407 if check_additivity and self.model.model_output == "raw": --> 408 self.assert_additivity(out, self.model.predict(X)) 409 410 return out /databricks/python/lib/python3.9/site-packages/shap/explainers/_tree.py in assert_additivity(self, phi, model_output) 537 if type(phi) is list: 538 for i in range(len(phi)): --> 539 check_sum(self.expected_value[i] + phi[i].sum(-1), model_output[:,i]) 540 else: 541 check_sum(self.expected_value + phi.sum(-1), model_output) /databricks/python/lib/python3.9/site-packages/shap/explainers/_tree.py in check_sum(sum_val, model_output) 533 " was %f, while the model output was %f. If this difference is acceptable" \ 534 " you can set check_additivity=False to disable this check." % (sum_val[ind], model_output[ind]) --> 535 raise Exception(err_msg) 536 537 if type(phi) is list: Exception: Additivity check failed in TreeExplainer! Please ensure the data matrix you passed to the explainer is the same shape that the model was trained on. If your data shape is correct then please report this on GitHub. Consider retrying with the feature_perturbation='interventional' option. This check failed because for one of the samples the sum of the SHAP values was 34.761224, while the model output was 0.070000. If this difference is acceptable you can set check_additivity=False to disable this check. Implementation ============== Overview of possible implementations Tasks ===== Concrete tasks to be completed, in the order they need to be done in. Include links to specific lines of code where the task should happen at. - [ ] Task 1 - [ ] Task 2 - [ ] Task 3