WinVector / vtreat

vtreat is a data frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. Distributed under choice of GPL-2 or GPL-3 license.
https://winvector.github.io/vtreat/
Other
284 stars 45 forks source link

Question: High-cardinality factor impact coding #2

Closed rsuhada closed 8 years ago

rsuhada commented 8 years ago

I'm trying to understand the definition of the catB impact code. In the basic version described on the blog, I understood that the formula used there is equivalent to the empirical bayes based equation from the Micci-Barreca paper (eq. 3) for the weighting lambda(ni) = ni / (ni +1).

What is the motivation for this weight? The formula seem familiar and it seems better than the weights listed in the paper itself. I haven't managed to derive this weight, though..

The second question is: what is the relation of this original weight to the current implementation? From the code I see that now a smoothing factor (e.g. probT*smFactor) is added (and the result is log-ed). Is there a reference for this approach, where I could learn more?

Thank you very much!

JohnMount commented 8 years ago

vtreat uses P(y|x=level) without the lambda(n) weight (though we do use a pseudo-count smoothing factor through the smFactor argument). In both cases you are allowing in some bias to try to lower the variance (variance under possible re-samplings of training data) of the conditional estimates (in the hope of having lower overall expected error). The Micci-Barreca lambda isn't a pseudo-count but a bit more like a shrinkage in James–Stein estimator terms (eqn. 3 is a bit better than "shinkage" as lambda(n) is not constant). Misha Bilenko in his "learning by counts" gets around this by exposing all the pieces (numerator, denominator) to the downstream machine learning so it can work with these.

We prefer the pseudo-count smFactor as it "feels a bit more Bayesian" and is in "observation units."

The logging of the models is simply because a lot of the downstream modeling assumes a impose an additive or linear structure. So working in log units tempting. Good idea if the next stage is linear. Probably not a great idea if the next stage is logistic regression (but doesn't seem to hurt too much). Doesn't matter much at all if the next stage is monotone (like tree based methods such as decision trees, random Forests, or gradient boosting).

Don't really have many more "on topic" references. The ideas are just collected by experience and taste and may not be the only plausible design. We tend to use "The Elements of Statistical Learning; 2nd edition" Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie and "Bayesian Data Analysis; 3rd edition" Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, Donald B. Rubin as our machine learning reference.