dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.3k stars 8.73k forks source link

[Feature request] Multi-categorical input features #9009

Open svandenhoek opened 1 year ago

svandenhoek commented 1 year ago

It's good to see XGBoost supporting categorical features. In a project we were looking at using native XGBoost categories, though our dataset also includes a multi-categorical feature. My question is if these type of features will also be supported natively within XGBoost.

So for example: feat1
a
a,c
b,d
Which should be interpreted as: feat1_a feat1_b feat1_c feat1_d
1 0 0 0
1 0 1 0
0 1 0 1
trivialfis commented 1 year ago

One simple solution would be creating new categories for each combination of existing categories, but that might not be feasible when cardinality is not trivial. Any suggestion?

svandenhoek commented 1 year ago

Besides feasibility (which depends on the used data), having a separate category for each combination could also obfuscate possibly useful information. While the combination on itself does yield a certain amount of information, it could also result in many different combination-categories that in reality might all have a single category in common (but combined with many different categories).

A possible solution would be to allow the user to define which features are multicategorical though an extra parameter (f.e. multi_cat_sep={'<column_name>':'<separator>'}). Then, instead of one hot encoding, process these differently (maybe something like MultiLabelBinarizer?).

Alternatively, something like a custom type could be considered to handle this (pandas for example allows defining custom types) so that XGBoost knows when it deals with a multicategorical feature. Probably more expensive to implement though.

trivialfis commented 1 year ago

XGBoost grows trees differently when using categorical features instead of simply uses a preprocessor. My question is more about how to obtain the optimal split value for multi-cat.