Closed mtchem closed 4 years ago
Thanks for the note and sorry for the gap in documentation. I'll leave this open until we update the docs.
As for the the details.
has_range
means the variable moves (isn't a constant).
vcount
is how many different variables were produced for a given treatment type. This lets us set a different significance threshold for different treatment types. So a data set with many indicator columns may still pass through interesting impact-coded columns. This is an improvement beyond our older "use 1/number_of_variables" as a significance threshold. This column is landed to make the recommended
column reproducibly derivable from other columns in the score frame. default_threshold
is set to 1/(vcount * num_treatment_types)
(num_treatment_types
is the number of different treatment types seen in the problem, typically about 5). If the significance
estimate is below default_threshold
the variable is recommended. This scheme only allows an expect constant number of truly useless columns through the treatment.
It looks like a lot of these details are in the "Deriving the Default Threshold" of the various tutorial examples (but using an out of date variable name ntreat
): https://github.com/WinVector/pyvtreat/blob/master/Examples/Classification/Classification.md (and so on).
We now have complete documentation of the score frame for vtreat
0.4.0
(which has now includes R2
): https://github.com/WinVector/pyvtreat/blob/master/Examples/ScoreFrame/ScoreFrame.md.
First, thank you so much for vtreat, it has definitely changed how I approach pre-processing data. I am trying to understand the different columns created by the method scoreframe for a BinomialOutcomeTreatment. I've looked through the python examples, the python api code, and the original paper, but I can't seem to find any information on 'has_range' and 'vcount' . What are the definitions of those columns and/or where can I find more documentation on scoreframe ?