add function to sort variables in correlation selection transformers

feature-engine / feature_engine

Feature engineering package with sklearn like functionality

https://feature-engine.trainindata.com/

BSD 3-Clause "New" or "Revised" License

1.8k stars 303 forks source link

add function to sort variables in correlation selection transformers #648

Closed solegalli closed 4 months ago

solegalli commented 1 year ago

Hey @glevv

I reorgaized the function in its own file, so we do not repeat code. Same with the docstrings. And I re-worded a bit the user guide.

For some reason, Github does not find your repo and does not allow me to make a PR to your branch. hence, I am opening a new PR here.

We'd need a test to corroborate that the sorting function is working within the transformer.

And also, the type tests are failing, Could you please have a look at these 2 things?

Thanks a lot!

solegalli commented 1 year ago

Sorted the mypy error.

However, when adding a test for column ordering, I found some odd behaviour when selecting features that I do not understand. This requires further investigation.

In addition, we could enhance smart selection transforming by adding a second round of correlation check after the first one in case there are correlated features remaining. See #641

solegalli commented 10 months ago

Trying to revive this PR @glevv @ dlaprins

TODO:

[ ] add tests to corroborate that the drop correlated transformer and the smart correlated are working as expected when sorting variables.
[ ] polish the user guide

Would you guys by any chance have time to help?

I am having a look at @dlaprins implementation in #619

I kind of like it better than the current one, because it seems faster. But it changes a lot of the intermediate attributes, so it is a bigger change. I wonder if we could blend that implementation without changing the learned attributes, to make this a smaller change, and then maybe in a subsequent PR we can change the attributes?

codecov[bot] commented 10 months ago

Codecov Report

Merging #648 (5a8de4a) into main (bd4c8fc) will increase coverage by 0.11%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #648      +/-   ##
==========================================
+ Coverage   97.99%   98.11%   +0.11%     
==========================================
  Files         100      101       +1     
  Lines        3849     3875      +26     
  Branches      752      759       +7     
==========================================
+ Hits         3772     3802      +30     
+ Misses         28       26       -2     
+ Partials       49       47       -2

Files Changed	Coverage Δ
...re_engine/_docstrings/init_parameters/selection.py	`100.00% <100.00%> (ø)`
feature_engine/selection/_helper_functions.py	`100.00% <100.00%> (ø)`
...ature_engine/selection/drop_correlated_features.py	`94.73% <100.00%> (+4.73%)`	:arrow_up:
...re_engine/selection/smart_correlation_selection.py	`100.00% <100.00%> (+2.35%)`	:arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

dlaprins commented 10 months ago

Thank you for the kind words @solegalli .

I'm a little bit lost on what exactly should be done. To my mind, after adding the option to specify the order of features in the base correlation selection makes smart correlation selection obsolete. In addition, the removal of the old attribute correlated_featuresets is desirable, because this attribute has no meaningful statistical interpretation due to lack of transitivity of collinearity.

The biggest issue I see is in backwards compatibility, specifically, which trade-off is worthwhile and which isn't. I personally like my selection algorithm (bottom-up rather than top-down) slightly better, but is it really worth the speed increase to break backwards compatibility like that? The new algorithm is guaranteed to give quite different results. As far as I'm concerned, whether this trade-off is worth it or not is up to the taste of the maintainers of the package. I think the decisions on which way to go with respect to 1) keeping smart selection around, 2) making use of the new algorithm, and 3) replacing correlated_featuresets with the more meaningful, but different, dictionary in my code should be done first before updating the user guide.

For what it's worth, I recall spending some time on updating the unit test for my algorithm and concluding that it functions as expected. But it's been a while, I recall @solegalli finding results that were not in line with expectations, and it could also well be that others find it insufficient.

glevv commented 9 months ago

Trying to revive this PR @glevv @ dlaprins

TODO:
* [ ]  add tests to corroborate that the drop correlated transformer and the smart correlated are working as expected when sorting variables.

* [ ]  polish the user guide
Would you guys by any chance have time to help?

I am having a look at @dlaprins implementation in #619

I kind of like it better than the current one, because it seems faster. But it changes a lot of the intermediate attributes, so it is a bigger change. I wonder if we could blend that implementation without changing the learned attributes, to make this a smaller change, and then maybe in a subsequent PR we can change the attributes?

It uses std to order variables and as we discussed earlier (in #633) it's not a good strategy in general

dlaprins commented 9 months ago

Certainly. It can be replaced with coefficient of variation, or IQR or MAD for more robust alternatives. Any of these as a default would be fine, I think the option to have an ordered list as user input is the most important point