ajschumacher / ajschumacher.github.io

blog
http://planspace.org/
20 stars 21 forks source link

de-categorizing categorical data #287

Open ajschumacher opened 2 years ago

ajschumacher commented 2 years ago

Here's a method they use for the xgboost paper https://arxiv.org/abs/1603.02754:

Since a tree based model is better at handling continuous features, we preprocess the data by calculating the statistics of average CTR and count of ID features on the first ten days, replacing the ID fea- tures by the corresponding count statistics during the next ten days for training.

ajschumacher commented 2 years ago

weight of evidence (WOE) is one of these too; see #288

ajschumacher commented 2 years ago

in catboost paper https://arxiv.org/pdf/1706.09516.pdf

Further, there is a similar issue in standard algorithms of preprocessing categorical features. One of the most effective ways [6, 25] to use them in gradient boosting is converting categories to their target statistics. A target statistic is a simple statistical model itself, and it can also cause target leakage and a prediction shift.

[6] B. Cestnik et al. Estimating probabilities: a crucial task in machine learning. In ECAI, volume 90, pages 147–149, 1990.

[25] D. Micci-Barreca. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter, 3(1):27–32, 2001. http://helios.mm.di.uoa.gr/~rouvas/ssi/sigkdd/sigkdd.vol3.1/barreca.pdf

ajschumacher commented 2 years ago

For a categorical feature with high cardinality (#category is large), it often works best to treat the feature as numeric, either by simply ignoring the categorical interpretation of the integers or by embedding the categories in a low-dimensional numeric space.

https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html

ajschumacher commented 2 years ago

https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931

There seems to be no reason to use One-Hot Encoding over Numeric Encoding.

(one-hot not good for trees)