Open david-cortes opened 1 year ago
Hi @david-cortes
Thanks for the suggestion!
I am not sure I understand what the output of the encoder should be. Could you give us an example?
For example, if there is a column taking possible values [1, 2, 3]
and we want an ordinal mean encoding, the mapping would be:
1 -> mean(y[x <= 1])
2 -> mean(y[x <= 2])
3 -> mean(y[x <= 3])
i.e. a mean calculated by grouping over rows that have a value <= than a threshold in the column being encoded (so the calculation for a value of 2 would also involve rows with a value of 1), instead of a mean calculated by grouping over each value separately.
thank you
I second this. This type of encoding is very useful for linear modeling especially. It has an averaging effect on ordinal variables that is much more stable than simple one-hot encoding.
@solegalli if I get a pull request together along with examples of how it is beneficial, is this something the team would consider merging?
Hey @AnotherSamWilson
Thanks for joining this discussion.
Yes, we tend to be quite open towards new functionality.
I've never heard of / read about this type of encoding. Is there an article that you could link for more info? Or is this something that you guys do practically? common practice in some industry?
To make it meaningful for potential users, we would have to add, besides the functionality, a good user guide with examples of how to use this class, and explanations about what constitutes a good use case for this type of encoding. You seem to have it covered though, because you mention examples of how this would be beneficial. So go for it!
I look forward to the PR :)
Couldn't this be accomplished by using ArbitraryDiscretiser followed by MeanEncoder?
Couldn't this be accomplished by using ArbitraryDiscretiser followed by MeanEncoder?
No, because it'd require overlaps between rows.
Oftentimes, one wants to build linear models having ordinal variables as features (e.g. "rate in a scale from 1 to 5 ..."). One might treat these as numerical or categorical, but this loses some information.
Would be nice to have ordinal versions of some typical categorical encoders, such as mean/frequency encoders that would do the grouping by a condition
x<=c
instead ofx==c
.