Closed cheitzig closed 4 years ago
The categorical features that have too many unique values should be preprocessed, such as merging into some major categories. If adding a wildcard bin, the woe value of which is difficult to define.
Got it. Thanks.
We have a dataset of about 3mm records. We're building a model using a 700k training sample and a 300k test sample.
We're building the WoE bins based on the 700k training set, and it turns out that for a few of the categorical variables (e.g., 3-digit zipcode), there are values in the test set that aren't in the training set.
Two thoughts/questions: