Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
When producing a plot of a tree with categorical splits, the plots will use the numbers of the categories:
import numpy as np, xgboost as xgb
rng = np.random.default_rng(seed=123)
X = rng.integers(4, size=(100,3))
y = rng.standard_normal(size=100)
dm = xgb.DMatrix(
data=X,
label=y,
feature_types=["c"]*3
)
model = xgb.train(
dtrain=dm,
params={
"tree_method" : "hist",
"max_depth" : 2
},
num_boost_round=3
)
xgb.plot_tree(model)
Categorical features typically have named categories. Would be quite helpful to show those on the plots instead of the numbers, which might not be easy to mentally map to a given category.
For this, I guess that a potential solution could be to add an additional dmatrix/booster string attribure for "categorical_names" or so, like there is a "feature_name".
XGBoost takes encoded categories instead of raw data, as a result, there's no name for them. We need to think of a way to pass the information from the encoder to XGB
When producing a plot of a tree with categorical splits, the plots will use the numbers of the categories:
Categorical features typically have named categories. Would be quite helpful to show those on the plots instead of the numbers, which might not be easy to mentally map to a given category.
For this, I guess that a potential solution could be to add an additional dmatrix/booster string attribure for "categorical_names" or so, like there is a "feature_name".