Ekeany / Boruta-Shap

A Tree based feature selection tool which combines both the Boruta feature selection algorithm with shapley values.
MIT License
559 stars 86 forks source link

[BUG] Misplaced important features #73

Open 564C52 opened 2 years ago

564C52 commented 2 years ago

Hello and thank you for this wonderful tool.

I work with a regression Boruta setup as :

Feature_Selector.fit(X=X, y=y, n_trials=100, sample=False, train_or_test = 'test', normalize=True, verbose=True) Feature_Selector.plot(which_features='all')

The ranking of the main features works pretty well and makes sense considering a cross-validated RF model I ran earlier. However, A bunch of these features has a 'funny' rank.

For instance, some 'confirmed important' features are located below the _maxshadow and symmetrically for another dataset, some 'confirmed unimportant' are located above the _maxshadow.

here are two Feature_Selector.plots showing the issue. I hope you can help me with this.

ex_1

ex_2

Best wishes, and I'm sorry if I forgot anything that would help the understanding of my problem.

Ekeany commented 2 years ago

Yes I have seen this before.

This happens as we are not comparing the average importance value at the end of the program.

Instead the acceptance or rejectance criteria is determined after each run using a Bernoulli trial.

So those features that have been rejected (although they have a higher average importance value) have had an importance feature underneath the the max shadow feature a statistical significant number of times to be rejected.

jhmenke commented 2 years ago

Just to clarify: The accepted/rejected value is more important than the mean feature importance?

In my case, some features with negative importance are accepted while others with positive importance are rejected.

Ekeany commented 2 years ago

No we treat it as a bernouli distribution.

So if a feature has had an importance greater than the random features a significant number of times than it is accepted if not than it is rejected.

So a feature can be accepted even though it has a mean feature importance less than the random feature importances.

Hope that helps.

On Mon 3 Jan 2022, 14:54 Jan-Hendrik Menke, @.***> wrote:

Just to clarify: The accepted/rejected value is more important than the mean feature importance?

In my case, some features with negative importance are accepted while others with positive importance are rejected.

— Reply to this email directly, view it on GitHub https://github.com/Ekeany/Boruta-Shap/issues/73#issuecomment-1004146183, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMDEERTVJSKGSG2RUHZ5HRDUUG2CRANCNFSM5FRTPCEA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

jhmenke commented 2 years ago

Yes, thank you.

caballerown commented 2 years ago

FYI - The comments under fit() correspond to the steps in the standard Boruta algorithm set forth by Kursa & Rudnicki (i.e., via the MZSA test). As discussed in this thread, that is not how BorutaSHAP determines whether to keep or discard a feature. So... I found myself a little confused when inspecting the code until I found this thread.