feature-engine / feature_engine

Feature engineering package with sklearn like functionality
https://feature-engine.trainindata.com/
BSD 3-Clause "New" or "Revised" License
1.92k stars 312 forks source link

Encoding ordinal variables #613

Open david-cortes opened 1 year ago

david-cortes commented 1 year ago

Oftentimes, one wants to build linear models having ordinal variables as features (e.g. "rate in a scale from 1 to 5 ..."). One might treat these as numerical or categorical, but this loses some information.

Would be nice to have ordinal versions of some typical categorical encoders, such as mean/frequency encoders that would do the grouping by a condition x<=c instead of x==c.

solegalli commented 1 year ago

Hi @david-cortes

Thanks for the suggestion!

I am not sure I understand what the output of the encoder should be. Could you give us an example?

david-cortes commented 1 year ago

For example, if there is a column taking possible values [1, 2, 3] and we want an ordinal mean encoding, the mapping would be:

1 -> mean(y[x <= 1])
2 -> mean(y[x <= 2])
3 -> mean(y[x <= 3])

i.e. a mean calculated by grouping over rows that have a value <= than a threshold in the column being encoded (so the calculation for a value of 2 would also involve rows with a value of 1), instead of a mean calculated by grouping over each value separately.

solegalli commented 1 year ago

thank you

AnotherSamWilson commented 1 year ago

I second this. This type of encoding is very useful for linear modeling especially. It has an averaging effect on ordinal variables that is much more stable than simple one-hot encoding.

@solegalli if I get a pull request together along with examples of how it is beneficial, is this something the team would consider merging?

solegalli commented 1 year ago

Hey @AnotherSamWilson

Thanks for joining this discussion.

Yes, we tend to be quite open towards new functionality.

I've never heard of / read about this type of encoding. Is there an article that you could link for more info? Or is this something that you guys do practically? common practice in some industry?

To make it meaningful for potential users, we would have to add, besides the functionality, a good user guide with examples of how to use this class, and explanations about what constitutes a good use case for this type of encoding. You seem to have it covered though, because you mention examples of how this would be beneficial. So go for it!

I look forward to the PR :)

kylegilde commented 1 year ago

Couldn't this be accomplished by using ArbitraryDiscretiser followed by MeanEncoder?

david-cortes commented 1 year ago

Couldn't this be accomplished by using ArbitraryDiscretiser followed by MeanEncoder?

No, because it'd require overlaps between rows.