feature-engine / feature_engine

Feature engineering package with sklearn like functionality
https://feature-engine.trainindata.com/
BSD 3-Clause "New" or "Revised" License
1.8k stars 303 forks source link

PSI feature selection for categorical features #655

Closed dlaprins closed 1 year ago

dlaprins commented 1 year ago

Currently, PSI feature selection as implemented in selection/drop_psi_features.py can only be applied to numerical features: it first converts these numerical features to categoricals via bucketing, then calculates PSI for the discretised numerical features.

It would be useful to also be able to use DropHighPSIFeatures on categorical features. I believe the implementation is fairly straightforward.

I'd be happy to pick up this issue if it is of interest to others.

solegalli commented 1 year ago

Hi @dlaprins

Do you normally use the PSI for categorical variables?

In principle, it makes sense to me.

@gverbock introduced the class to Feature-engine, so it would be interesting to hear from you as well.

@gverbock does this sound OK to you too? As in, do you use the PSI for categorical variables?

gverbock commented 1 year ago

Intuitively I would use the chi-square to test this for categorical values. Using PSI could be nice because it allows to compute sample similarity for numerical and categorical variables at once. What I wonder is if PSI is suited for this. If PSI gives completely different results as a chi-square, it may be good not to offer this option. @dlaprins Are you aware of literature comparing the two approaches?

dlaprins commented 1 year ago

Hi @solegalli , @gverbock . PSI is defined as a sum over bins, which are by definition discrete (i.e., categorical). There are two issues that affect PSI which are present for numericals that are not problematic for (nominal) categoricals:

Therefore, I would suggest that it is perhaps even more natural to make use of PSI for categoricals than it is to make use of it for numericals.

As for chi-square tests and the accompanying literature, there is reference in the code (added by @glevv I believe? I could be mistaken) to Yurdakul's PhD thesis (2018, see https://scholarworks.wmich.edu/cgi/viewcontent.cgi?article=4249&context=dissertations). Theorems 2.2.3, 2.4.1 essentially show that up to higher order corrections in sample size, PSI is chi-square distributed under the null hypothesis that the two populations are identically distributed.

gverbock commented 1 year ago

@dlaprins It sounds like extending to categorical variables would be adding value.

ClaudioSalvatoreArcidiacono commented 1 year ago

Hey All, I would like to contribute on this issue as I already had a working implementation of this on a private fork.

dlaprins commented 1 year ago

Hi @ClaudioSalvatoreArcidiacono, I actually just finished my implementation. I opened a PR, I'd be very happy if you could critique it (or, if it's beyond salvaging, feel free to push your own code ;) ).

ClaudioSalvatoreArcidiacono commented 1 year ago

Hey @dlaprins, I just noticed your message, I will take a look at your PR and we can keep the one that is more complete :)

ClaudioSalvatoreArcidiacono commented 1 year ago

Hey @dlaprins, nice PR, I see that we can cherry pick some things from both branches to get to the best final result. I actually like your implementation in the source code better, but I think my tests are better at testing the added functionality.

I saw that in your PR you also changed the way the auto threshold is computed. This is great, but it is out of scope for this issue in my opinion. I would suggest you to open a separate issue for that. What do you think?

dlaprins commented 1 year ago

Thanks @ClaudioSalvatoreArcidiacono, fully agree that your unit tests are better.

I am not sure that the auto threshold is out of scope though: the auto threshold is dependent on the number of bins/categories. The numerical feature implementation sets the same number of categories for all features by means of the bucketer, but categorical variables generically have differing number of categories. As an example, consider a dataset with a numerical feature, a categorical with 3 categories, and a categorical with 2 categories: the auto threshold cannot be computed correctly for all three features without making the function take an extra argument for the number of bins. Would you agree?

Shall I add your tests to my PR?

ClaudioSalvatoreArcidiacono commented 1 year ago

Now I understand, thanks for explaining. It makes sense then.

Go ahead and copy the unit tests from my branch to your branch.

Beware that your current implementation does not support categorical features encoded as strings, and not as type category, so some of the unit tests will fail because of that.

Moreover, some unit tests are currently failing also on the main branch due to the release of pandas 2.0. See also #659. I am also working on a fix for that.

I can review your PR once it is ready if you want.

solegalli commented 1 year ago

Lot's of discussion going on on this issue! Thank you so much for the engagement guys!

I was on holidays and slowly catching-up. I'll leave a comment somewhere if needed. Thanks a lot!