Open oxinabox opened 7 years ago
For inspiration. Caret in R has an option in the preprocessing step to remove "near zero variance" features. While in that case it is about features I do think the way the parameters are specified is interesting.
freqCut: the cutoff for the ratio of the most common value to the second most common value
Bottom line, the cutoff is specified in relation to the most common value using what is referred to as a frequency ratio. I think that could make sense for us here as well. For example cutoff = 95/5
would say for every 95 observations with the most common label, every other label must have at least 5 observations in order to not be excluded
alternate API idea based on your example for brainstorming. (as a side note: I'd prefer function names simple enough to not require underlines):
origlbls = [:A, :A, :A, :B, :B, :B, :C,:D, :A]
whitelist, blacklist = rarelabel(orglbls, ratio=2/1)
@assert whitelist = [:A,:B]
@assert blacklist = [:C,:D]
LabelEnc.ManyVsOther(whitelist, blacklist, other=:X)
LabelEnc.ManyVsOther(rarelabel(orglbls)..., other=:X)
other name ideas for rarelabel
freqlabel
partitionlabel
Hi,
thanks for creating this, JuliaML ecosystem seems really cool!
I'll just mention how other people dealt with this in two quite popular R packages.
There is the fct_lump
function from the forcats
R package. It controls how different levels of a factor are lumped together by two parameters, n and prop:
If both `n` and `prop` are missing, `fct_lump` lumps
together the least frequent levels into "other", while ensuring that
"other" is still the smallest level. It's particularly useful in
conjunction with \code{\link{fct_inorder}()}.
Positive `n` preserves the most common `n` values.
Negative `n` preserves the least common `-n` values.
It there are ties, you will get at least `abs(n)` values.
Positive `prop`, preserves values that appear at least
`prop` of the time. Negative `prop`, preserves values that
appear at most `-prop` of the time.
I can also imagine that people would want to combine labels based on some measure with respect the dependent variable (treating each label as a dummy variable), e.g. R package vtreat
deals with rare labels in its designTreatmentsC
function:
...
minFraction -- optional minimum frequency a categorical level must have to be converted to an indicator column.
rareCount -- optional integer, allow levels with this count or below to be pooled into a shared rare-level. Defaults to 0 or off.
rareSig -- optional numeric, suppress levels from pooling at this significance value greater. Defaults to NULL or off.
...
the rareSig
threshold is applied to p-values from functions like:
function(level) {
m <- stats::lm(yNumeric ~ (vcol == level), weights = weights)
stats::anova(m)[1, "Pr(>F)"]
}
That is valuable information. Thanks for taking the time to summarize this
As this is still open, I'll throw in another 2 of my cents:). Encoding rare categories is definitely useful for exploratory analysis and visualization, but I'd say suboptimal for actual modeling (as we're throwing away some information). Best case scenario, you can use some bayesian magic in your model and smooth away / shrink the high variance of effects of rare categories in a supervised manner. To this end it sometimes suffices to replace the levels of a categorical variable with their corresponding random effects from a very simple mixed model (and julias MixedModels.jl can handle really large categorical vectors I think). This way you can encode your categorical variable into a numerical one, you have a sensible way of treating unobserved levels in the future, and it won't overfit rare levels as much as simple mean/weight of evidence coding.
The other popular option is to use shallow nnets to embed the levels, but I think that would require some smart regularization to handle the rare ones sensibly.
Both of these ideas are nicely implemented in R package embed
(tidymodels/embed
on github). I really like the recent work in tidymodels, but these approaches might be best fit for some "PreprocessingPipelines.jl" package.
In cases with highly unbalanced class distribution, some classes occur in the training data so rarely, that it is better to ignore them. In these cases, it would be useful to be able to set a threshold, and have all labels that occur less than this be mapped to a single label.
Here I will work with symbols for example clarity.
The data of which input labels are not rare, should be stored in the encoding, so it can be repeated. Possibly it should also have a parameter to determine if new never before seen labels in the test data are an error, or just another rare label. (Possibly not, though).
This encoding would be chained with other encodings.
It would also go well with a method to filter in MLDataUtil, for when it is permissible to exclude these rare labels from the training entirely.