feature-engine / feature_engine

Feature engineering package with sklearn like functionality
https://feature-engine.trainindata.com/
BSD 3-Clause "New" or "Revised" License
1.8k stars 303 forks source link

SmartCorrelatedSelection is not removing all correlated features #641

Closed gverbock closed 1 year ago

gverbock commented 1 year ago

When running some project with SmartCorrelatedSelection I found an unexpected behavior. I have the following correlation matrix

image

If I am using

image

Then the result is a matrix with correlations higher than the threshold.

image

f_5 and f_6 are still correlated above the 0.8 threshold. I had a quick look at the issue and I believe it is related to the definition of the _examined_features. If a feature is selected and all others are already in the _examined_features, it will be considered as a non-correlated feature but that is not by definition the case.

gverbock commented 1 year ago

I see this behaviour on a dataset used for work so I cannot put the full example here. I will try to reproduce it on synthetic data to better understand what is exactly going on here.

solegalli commented 1 year ago

Hey @gverbock

I think you already describe what's going on here. f_6 is grouped with f_2, then f_2 is removed from the data, and f_6 remains in the data. But f_6 is also correlated with f_5. But f_5 was not correlated with f_2 above the threshold.

We've been discussing this for a while. I link #327

I am not sure how to resolve this problem to be honest. We've got a PR #633 that allows us to order the features. This will ensure reproducibility. But it does not address this particular issue.

gverbock commented 1 year ago

That's correct. I'll close this issue then and will think about it further.