dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.1k stars 8.7k forks source link

Plots for categorical splits don't show named categories #9927

Open david-cortes opened 8 months ago

david-cortes commented 8 months ago

When producing a plot of a tree with categorical splits, the plots will use the numbers of the categories:

import numpy as np, xgboost as xgb
rng = np.random.default_rng(seed=123)
X = rng.integers(4, size=(100,3))
y = rng.standard_normal(size=100)
dm = xgb.DMatrix(
    data=X,
    label=y,
    feature_types=["c"]*3
)
model = xgb.train(
    dtrain=dm,
    params={
        "tree_method" : "hist",
        "max_depth" : 2
    },
    num_boost_round=3
)
xgb.plot_tree(model)

image

Categorical features typically have named categories. Would be quite helpful to show those on the plots instead of the numbers, which might not be easy to mentally map to a given category.

For this, I guess that a potential solution could be to add an additional dmatrix/booster string attribure for "categorical_names" or so, like there is a "feature_name".

trivialfis commented 8 months ago

XGBoost takes encoded categories instead of raw data, as a result, there's no name for them. We need to think of a way to pass the information from the encoder to XGB