feature-engine / feature_engine

Feature engineering package with sklearn like functionality
https://feature-engine.trainindata.com/
BSD 3-Clause "New" or "Revised" License
1.8k stars 304 forks source link

Raise informative error on duplicated column names #686

Closed david-cortes closed 9 months ago

david-cortes commented 11 months ago

This PR adds an informative error message in cases in which the user supplies inputs having duplicated column names, which otherwise manifest in hard-to-track errors (e.g. https://github.com/feature-engine/feature_engine/pull/681).

david-cortes commented 11 months ago

Moved the check to check_X instead as suggested above.

Regarding errors from pandas, it does throw errors sometimes when there are duplicates, but only under some particular situations: https://pandas.pydata.org/pandas-docs/stable/user_guide/duplicates.html

Also changed the mechanism towards the attribute is_unique as it seems that's what they recommend in their guide.

david-cortes commented 11 months ago

Moved the tests to test_dataframe_checks.py.

david-cortes commented 11 months ago

Added a check on the error message.

solegalli commented 9 months ago

Hey @david-cortes

I made a PR to your repo: https://github.com/david-cortes/feature_engine/pull/4

Where I rebase main and add this contribution to the changelog.

Would you have time to merge over there, so it updates here and I can merge and close?

Thanks a lot!

david-cortes commented 9 months ago

Hey @david-cortes

I made a PR to your repo: david-cortes#4

Where I rebase main and add this contribution to the changelog.

Would you have time to merge over there, so it updates here and I can merge and close?

Thanks a lot!

Thanks, although I think you should also be able to push changes to the branch directly.

codecov[bot] commented 9 months ago

Codecov Report

Merging #686 (a53a7bd) into main (3343305) will increase coverage by 0.00%. Report is 1 commits behind head on main. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #686   +/-   ##
=======================================
  Coverage   97.99%   97.99%           
=======================================
  Files         100      100           
  Lines        3843     3849    +6     
  Branches      754      752    -2     
=======================================
+ Hits         3766     3772    +6     
  Misses         28       28           
  Partials       49       49           
Files Changed Coverage Δ
feature_engine/creation/math_features.py 97.77% <ø> (ø)
feature_engine/dataframe_checks.py 97.05% <100.00%> (+0.08%) :arrow_up:
feature_engine/datetime/datetime.py 100.00% <100.00%> (ø)
feature_engine/datetime/datetime_subtraction.py 94.73% <100.00%> (+0.07%) :arrow_up:
feature_engine/encoding/base_encoder.py 100.00% <100.00%> (ø)
feature_engine/encoding/one_hot.py 100.00% <100.00%> (ø)
feature_engine/encoding/rare_label.py 100.00% <100.00%> (ø)
feature_engine/imputation/categorical.py 95.31% <100.00%> (ø)
feature_engine/selection/shuffle_features.py 100.00% <100.00%> (ø)
feature_engine/transformation/yeojohnson.py 100.00% <100.00%> (ø)
... and 1 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more