feature-engine / feature_engine

Feature engineering package with sklearn like functionality
https://feature-engine.trainindata.com/
BSD 3-Clause "New" or "Revised" License
1.8k stars 303 forks source link

ResidualEncoder for categorical varaibles #642

Closed Morgan-Sell closed 5 months ago

Morgan-Sell commented 1 year ago

Is your feature request related to a problem? Please describe. On page 168 of "Practical Statistics for Data Scientists", the authors discuss grouping categorical variables using the residuals from a regression.

The code can be found here starting on line 230.

Describe the solution you'd like Proposed steps:

  1. Select categorical features to be encoded
  2. Select predictor features to use
  3. Select a regression function
  4. Calculate the residuals for all observations
  5. Derive the median residual for each unique variable of the categorical variables
  6. Group/discretize the residuals using pd.cut.

Describe alternatives you've considered n/a

Additional context Will search for additional research.

solegalli commented 1 year ago

Hey @Morgan-Sell

Thanks for the suggestion.

It's a massive file the one you linked. Which lines of code are the relevant ones? the ones that are actually doing the encoding?

So that I can understand what this is about?

What is the idea? you use numerical variables to predict the categories of the categorical ones? get the residuals between what and what? I don't understand lol.

Does the book include a reference? where is this coming from? what's the logic of this encoding? when is it suitable? I'd probably need to read the book.

Morgan-Sell commented 1 year ago

The code is from lines 230 to 264.

The example involves predicting housing prices in Seattle. In the example, the author encodes Zip Codes. I think the encoder is to be used with categorical variables with high cardinality (I think).

I don't know how much the book will help. The book unfortunately is brief in its description of the transformation. I thought it was a clever idea to transform a variable based on residuals. Residuals can possess significant insight.

solegalli commented 1 year ago

Hey @Morgan-Sell

If the book you mention doesn't explain the encoding clearly and does not quote an additional reference, I am inclined to close this issue.

In short, it would be good to know how well accepted and how well grounded the encoding is, to make it part of feature-engine. Based on your previous reply, it sounds like it is not super clear.

If you agree, then pls close it :)