feature-engine / feature_engine

Feature engineering package with sklearn like functionality
https://feature-engine.trainindata.com/
BSD 3-Clause "New" or "Revised" License
1.8k stars 303 forks source link

return name of variables with 0s in denominator in woe encoder #676

Closed solegalli closed 1 year ago

solegalli commented 1 year ago

closes #672 closes #430 closes #56

and also closes #673

codecov[bot] commented 1 year ago

Codecov Report

Merging #676 (a9cefc1) into main (e73772d) will increase coverage by 0.07%. The diff coverage is 99.28%.

@@            Coverage Diff             @@
##             main     #676      +/-   ##
==========================================
+ Coverage   97.91%   97.99%   +0.07%     
==========================================
  Files         100      100              
  Lines        3748     3841      +93     
  Branches      726      754      +28     
==========================================
+ Hits         3670     3764      +94     
+ Misses         29       28       -1     
  Partials       49       49              
Impacted Files Coverage Δ
feature_engine/datetime/datetime_subtraction.py 94.66% <ø> (ø)
feature_engine/tags.py 100.00% <ø> (ø)
feature_engine/selection/drop_psi_features.py 99.40% <98.55%> (+1.08%) :arrow_up:
feature_engine/creation/relative_features.py 100.00% <100.00%> (ø)
feature_engine/encoding/rare_label.py 100.00% <100.00%> (ø)
feature_engine/encoding/woe.py 100.00% <100.00%> (ø)
feature_engine/imputation/drop_missing_data.py 100.00% <100.00%> (ø)
feature_engine/selection/shuffle_features.py 100.00% <100.00%> (ø)
feature_engine/transformation/reciprocal.py 100.00% <100.00%> (ø)
..._engine/variable_handling/_variable_type_checks.py 92.30% <100.00%> (+1.00%) :arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

solegalli commented 1 year ago

Hey @glevv

I am expanding the functionality of the WoE encoder to do 2 things:

  1. return in the error message the name of all variables that had categories with zero in the numerator or denominator of the WoE calculation (that happens when the sum of target=0 or target=1 is zero for that category)
  2. when the denominator or numerator is zero, replace by an arbitrary value and proceed with the calculation

Regarding 1: the error was there already. The improvement consists in letting users know which variables are the problematic

Regarding 2: I think this defeats the point of the calculation, but I guess we leave this to the user. The implementation is different from category encoders because 1) category encoders adds the regularization to all categories, regardless of whether they have zero or not. Here we modify only those with 0. And 2) we replace the 0 with the fill value. Category encoders adds the regularization to both denominator and numerator)

I would appreciate if you had 2 minutes to let me know what you think about the implementation, and if you can go over the code, even better.

Thanks a lot!