dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.11k stars 8.7k forks source link

[R] Cannot plot trees with categorical splits #9925

Open david-cortes opened 8 months ago

david-cortes commented 8 months ago

ref https://github.com/dmlc/xgboost/issues/9810

Currently, attempting to plot trees that have categorical splits in R will result in an error:

library(xgboost)
set.seed(123)
y <- rnorm(100)
x <- sample(3, size=100*3, replace=TRUE) |> matrix(nrow=100)
x <- x - 1
dm <- xgb.DMatrix(data=x, label=y)
setinfo(dm, "feature_type", c("c", "c", "c"))
model <- xgb.train(
    data=dm,
    params=list(
        tree_method="hist",
        max_depth=3
    ),
    nrounds=2
)
xgb.plot.tree(model=model)
Error in do.call(rbind, matches)[, c(2, 3, 5, 6, 7, 8, 10), drop = FALSE] :
subscript out of bounds

This is due to the regexes used to parse the dumps not having been updated for the format used in categorical splits: https://github.com/dmlc/xgboost/blob/a197899161fa70e681101de4232745fdfe737804/R-package/R/xgb.model.dt.tree.R#L123

david-cortes commented 8 months ago

@mayer79 Perhaps you would like to work on this issue?

trivialfis commented 8 months ago

I hope we can remove the regex if possible. XGB can output graphviz dump. I can help add other formats if necessary.

david-cortes commented 8 months ago

I hope we can remove the regex if possible. XGB can output graphviz dump. I can help add other formats if necessary.

@trivialfis would be very helpful to add a format "table" which would output the same as python function trees_to_dataframe. Then we can get rid of the regexes in both interfaces and avoid needing to update when vector leaves are implemented.

trivialfis commented 8 months ago

Yes, I did a proof of concept before, but didn't submit a PR because at the time I was wondering how to export the data to arrow. I can make another attempt.

david-cortes commented 8 months ago

Yes, I did a proof of concept before, but didn't submit a PR because at the time I was wondering how to export the data to arrow. I can make another attempt.

I don't think arrow is necessary here - these tables are going to be rather small in most cases, so perhaps a plain JSON with one entry per column in the table would do.

trivialfis commented 8 months ago

I don't think arrow is necessary here

It's more future-proof, we already had feature requests for representing the model as a table, which means XGBoost needs to be able to save and load models as tables. Currently, the to_table method doesn't have a corresponding from_table implementation.

Another thing about Arrow is that the performance is just a bonus, I believe the goal is to have a protocol-like class that can be used for other projects. For example, the spark framework uses Arrow as the underlying representation of a table and uses it to transfer dataframe from Java processes to Python processes, presumably to R as well. As a result, if we are dealing with dataframe, exporting directly to Arrow might be the most efficient and useful way to do it.