Open svandenhoek opened 1 year ago
One simple solution would be creating new categories for each combination of existing categories, but that might not be feasible when cardinality is not trivial. Any suggestion?
Besides feasibility (which depends on the used data), having a separate category for each combination could also obfuscate possibly useful information. While the combination on itself does yield a certain amount of information, it could also result in many different combination-categories that in reality might all have a single category in common (but combined with many different categories).
A possible solution would be to allow the user to define which features are multicategorical though an extra parameter (f.e. multi_cat_sep={'<column_name>':'<separator>'}
). Then, instead of one hot encoding, process these differently (maybe something like MultiLabelBinarizer?).
Alternatively, something like a custom type could be considered to handle this (pandas for example allows defining custom types) so that XGBoost knows when it deals with a multicategorical feature. Probably more expensive to implement though.
XGBoost grows trees differently when using categorical features instead of simply uses a preprocessor. My question is more about how to obtain the optimal split value for multi-cat.
It's good to see XGBoost supporting categorical features. In a project we were looking at using native XGBoost categories, though our dataset also includes a multi-categorical feature. My question is if these type of features will also be supported natively within XGBoost.