sort variables to ensure consistent results between runs in correlation transformers

feature-engine / feature_engine

Feature engineering package with sklearn like functionality

https://feature-engine.trainindata.com/

BSD 3-Clause "New" or "Revised" License

1.8k stars 304 forks source link

sort variables to ensure consistent results between runs in correlation transformers #703

Closed pangjac closed 4 months ago

pangjac commented 8 months ago

This PR aims to fix issues reported #570 and #684, where the SmartCorrelatedSelection attribute features_to_drop_ may not return the same result when run multiple times.

The issue is caused by the fact that the sort_values() method does not guarantee a consistent order between columns if they share the same value.

The fix is to ensure consistent ordering by sorting by the value first, and then by the feature name alphabetically for equal values.

Pytest has passed by pytest tests/test_selection/test_smart_correlation_selection.py . Other minor changes are automatically fixed by black or isort.

Please review.

codecov[bot] commented 8 months ago

Codecov Report

Merging #703 (06bbf15) into main (a819eb1) will increase coverage by 0.00%. Report is 3 commits behind head on main. The diff coverage is 100.00%.

:exclamation: Current head 06bbf15 differs from pull request most recent head 44d3b16. Consider uploading reports for the commit 44d3b16 to get more accurate results

@@           Coverage Diff           @@
##             main     #703   +/-   ##
=======================================
  Coverage   98.03%   98.04%           
=======================================
  Files         100      100           
  Lines        3877     3878    +1     
  Branches      761      761           
=======================================
+ Hits         3801     3802    +1     
  Misses         28       28           
  Partials       48       48

Files	Coverage Δ
...re_engine/selection/smart_correlation_selection.py	`97.67% <100.00%> (+0.02%)`	:arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

solegalli commented 8 months ago

Thank you @pangjac !! That was by far the easiest solutions from all the ones we've contemplated.

Could you add it in this transformer as well?: https://github.com/feature-engine/feature_engine/blob/main/feature_engine/selection/drop_correlated_features.py

solegalli commented 8 months ago

Would it be possible to add a test that fails with the original implementation but passes with your changes?

pangjac commented 8 months ago

Hi @solegalli I don't see there is related issues on drop_correlated_features, can you specify which kind of issue related?

Could you add it in this transformer as well?: https://github.com/feature-engine/feature_engine/blob/main/feature_engine/selection/drop_correlated_features.py

Also,

add a test that fails with the original implementation but passes with your changes?

I noticed that all of the tests in CircleCI test_feature_engine_py3x failed. I assume that these CircleCI tests are running on the main branch? I added 4 sub-test methods to show that features_to_drop_ is not consistent in the main branch.

If you switch to the branch bug/smartcorrselect-inconsist and run pytest pytest -s tests/test_selection/test_smart_correlation_selection.py, you will see that all of the newly added sub-tests pass. In this regard, what is the best way to add new pytest tests and include them in this pull request (if CircleCI automatically fails on the main branch)?

solegalli commented 8 months ago

Hey @pangjac

The tests are failing because seaborn is not a dependency of feature-engine, therefore it is not installed in the test environment on circleci.

Could we use a hand-crafted dataset for the test instead?

There is no issue associated with the [drop_correlated_features class, but it has the same problem, because smart correlation is just an upgrade of this more simple transformer.

solegalli commented 5 months ago

This PR has been over-seeded by #721