A Rare Label encoding, the combined all rare classes to one common label

oxinabox commented 7 years ago

In cases with highly unbalanced class distribution, some classes occur in the training data so rarely, that it is better to ignore them. In these cases, it would be useful to be able to set a threshold, and have all labels that occur less than this be mapped to a single label.

Here I will work with symbols for example clarity.

origlbls = [:A, :A, :A, :B, :B, :B, :C,:D, :A]

encoding, newlbls = relabel_rares(origlbls, threshold = 3, rarelabel=:X) #Signature for example only
@assert newlbls =  [:A, :A, :A, :B, :B, :B, :X,:X, :A]

The data of which input labels are not rare, should be stored in the encoding, so it can be repeated. Possibly it should also have a parameter to determine if new never before seen labels in the test data are an error, or just another rare label. (Possibly not, though).

This encoding would be chained with other encodings.

It would also go well with a method to filter in MLDataUtil, for when it is permissible to exclude these rare labels from the training entirely.

Evizero commented 7 years ago

For inspiration. Caret in R has an option in the preprocessing step to remove "near zero variance" features. While in that case it is about features I do think the way the parameters are specified is interesting.

freqCut: the cutoff for the ratio of the most common value to the second most common value

Bottom line, the cutoff is specified in relation to the most common value using what is referred to as a frequency ratio. I think that could make sense for us here as well. For example cutoff = 95/5 would say for every 95 observations with the most common label, every other label must have at least 5 observations in order to not be excluded

Evizero commented 7 years ago

alternate API idea based on your example for brainstorming. (as a side note: I'd prefer function names simple enough to not require underlines):

origlbls = [:A, :A, :A, :B, :B, :B, :C,:D, :A]

whitelist, blacklist = rarelabel(orglbls, ratio=2/1)

@assert whitelist =  [:A,:B]
@assert blacklist =  [:C,:D]

LabelEnc.ManyVsOther(whitelist, blacklist, other=:X)

LabelEnc.ManyVsOther(rarelabel(orglbls)..., other=:X)

other name ideas for rarelabel

freqlabel
partitionlabel

Drvi commented 7 years ago

Hi,

thanks for creating this, JuliaML ecosystem seems really cool!

I'll just mention how other people dealt with this in two quite popular R packages.

There is the fct_lump function from the forcats R package. It controls how different levels of a factor are lumped together by two parameters, n and prop:

If both `n` and `prop` are missing, `fct_lump` lumps
together the least frequent levels into "other", while ensuring that
"other" is still the smallest level. It's particularly useful in
conjunction with \code{\link{fct_inorder}()}.

Positive `n` preserves the most common `n` values.
Negative `n` preserves the least common `-n` values.

It there are ties, you will get at least `abs(n)` values.
Positive `prop`, preserves values that appear at least
`prop` of the time. Negative `prop`, preserves values that
appear at most `-prop` of the time.

I can also imagine that people would want to combine labels based on some measure with respect the dependent variable (treating each label as a dummy variable), e.g. R package vtreat deals with rare labels in its designTreatmentsC function:

...
minFraction -- optional minimum frequency a categorical level must have to be converted to an indicator column.
rareCount -- optional integer, allow levels with this count or below to be pooled into a shared rare-level. Defaults to 0 or off.
rareSig -- optional numeric, suppress levels from pooling at this significance value greater. Defaults to NULL or off.
...

the rareSig threshold is applied to p-values from functions like:


function(level) {
    m <- stats::lm(yNumeric ~ (vcol == level), weights = weights)
    stats::anova(m)[1, "Pr(>F)"]
}

Evizero commented 7 years ago

That is valuable information. Thanks for taking the time to summarize this

Drvi commented 5 years ago

As this is still open, I'll throw in another 2 of my cents:). Encoding rare categories is definitely useful for exploratory analysis and visualization, but I'd say suboptimal for actual modeling (as we're throwing away some information). Best case scenario, you can use some bayesian magic in your model and smooth away / shrink the high variance of effects of rare categories in a supervised manner. To this end it sometimes suffices to replace the levels of a categorical variable with their corresponding random effects from a very simple mixed model (and julias MixedModels.jl can handle really large categorical vectors I think). This way you can encode your categorical variable into a numerical one, you have a sensible way of treating unobserved levels in the future, and it won't overfit rare levels as much as simple mean/weight of evidence coding.

The other popular option is to use shallow nnets to embed the levels, but I think that would require some smart regularization to handle the rare ones sensibly.

Both of these ideas are nicely implemented in R package embed (tidymodels/embed on github). I really like the recent work in tidymodels, but these approaches might be best fit for some "PreprocessingPipelines.jl" package.

JuliaML / MLLabelUtils.jl

A Rare Label encoding, the combined all rare classes to one common label #5