WinVector / pyvtreat

vtreat is a data frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. Distributed under a BSD-3-Clause license.
https://winvector.github.io/pyvtreat/
Other
120 stars 8 forks source link

documentation fro score_frame_ ? #13

Closed mtchem closed 4 years ago

mtchem commented 4 years ago

First, thank you so much for vtreat, it has definitely changed how I approach pre-processing data. I am trying to understand the different columns created by the method scoreframe for a BinomialOutcomeTreatment. I've looked through the python examples, the python api code, and the original paper, but I can't seem to find any information on 'has_range' and 'vcount' . What are the definitions of those columns and/or where can I find more documentation on scoreframe ?

JohnMount commented 4 years ago

Thanks for the note and sorry for the gap in documentation. I'll leave this open until we update the docs.

As for the the details.

has_range means the variable moves (isn't a constant).

vcount is how many different variables were produced for a given treatment type. This lets us set a different significance threshold for different treatment types. So a data set with many indicator columns may still pass through interesting impact-coded columns. This is an improvement beyond our older "use 1/number_of_variables" as a significance threshold. This column is landed to make the recommended column reproducibly derivable from other columns in the score frame. default_threshold is set to 1/(vcount * num_treatment_types) (num_treatment_types is the number of different treatment types seen in the problem, typically about 5). If the significance estimate is below default_threshold the variable is recommended. This scheme only allows an expect constant number of truly useless columns through the treatment.

It looks like a lot of these details are in the "Deriving the Default Threshold" of the various tutorial examples (but using an out of date variable name ntreat): https://github.com/WinVector/pyvtreat/blob/master/Examples/Classification/Classification.md (and so on).

JohnMount commented 4 years ago

We now have complete documentation of the score frame for vtreat 0.4.0 (which has now includes R2): https://github.com/WinVector/pyvtreat/blob/master/Examples/ScoreFrame/ScoreFrame.md.