Open david-cortes opened 10 months ago
@mayer79 Perhaps you would like to work on this issue?
I hope we can remove the regex if possible. XGB can output graphviz dump. I can help add other formats if necessary.
I hope we can remove the regex if possible. XGB can output graphviz dump. I can help add other formats if necessary.
@trivialfis would be very helpful to add a format "table" which would output the same as python function trees_to_dataframe
. Then we can get rid of the regexes in both interfaces and avoid needing to update when vector leaves are implemented.
Yes, I did a proof of concept before, but didn't submit a PR because at the time I was wondering how to export the data to arrow. I can make another attempt.
Yes, I did a proof of concept before, but didn't submit a PR because at the time I was wondering how to export the data to arrow. I can make another attempt.
I don't think arrow is necessary here - these tables are going to be rather small in most cases, so perhaps a plain JSON with one entry per column in the table would do.
I don't think arrow is necessary here
It's more future-proof, we already had feature requests for representing the model as a table, which means XGBoost needs to be able to save and load models as tables. Currently, the to_table
method doesn't have a corresponding from_table
implementation.
Another thing about Arrow is that the performance is just a bonus, I believe the goal is to have a protocol-like class that can be used for other projects. For example, the spark framework uses Arrow as the underlying representation of a table and uses it to transfer dataframe from Java processes to Python processes, presumably to R as well. As a result, if we are dealing with dataframe, exporting directly to Arrow might be the most efficient and useful way to do it.
ref https://github.com/dmlc/xgboost/issues/9810
Currently, attempting to plot trees that have categorical splits in R will result in an error:
This is due to the regexes used to parse the dumps not having been updated for the format used in categorical splits: https://github.com/dmlc/xgboost/blob/a197899161fa70e681101de4232745fdfe737804/R-package/R/xgb.model.dt.tree.R#L123