Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
I think that I have found a bug with XGBoost. Suppose that I train a XGBoost model then take the booster via booster=model.get_booster() and that this booster has in its feature_types a 'i' (i.e. a boolean variable). Then the method booster.trees_to_dataframe() will crash and give this error: "ValueError: Failed to parse model text dump.".
How I came up with the bug: I recently updated a bunch of Python libraries (especially Pandas 1.5.1 to 2.2.2) and my script which used to worked now crashes at this step booster.trees_to_dataframe(). I looked in the source code at https://github.com/dmlc/xgboost in the file python-package/xgboost/core.py for the method trees_to_dataframe. The issue is in this part of the code:
if fid[0].find("<") != -1: ...
elif fid[0].find(":{") != -1: ...
else: raise ValueError("Failed to parse model text dump.")
The problem is that for a feature_type 'i' there's no "<" or ":{" to find in the string so it ends in the "else" part that raises a ValueError.
I found two ways to avoid the error on my side:
Explicitly cast Pandas DataFrame boolean columns as np.uint8 instead of bool (before, np.uint8 was the default resulting dtype for pd.get_dummies and now it changed to bool, hence why I now get the bug).
Before using booster.trees_to_dataframe(), cast the booster types as integers instead of booleans via booster.feature_types = ['int' if feature_type == 'i' else feature_type for feature_type in booster.feature_types].
In the long term I think that booster.trees_to_dataframe() should not crash with a booster where there is a feature type 'i'.
Here is a sample script to see that trees_to_dataframe crashes when there's a feature type 'i' and that the two suggested techniques do avoid the crash:
# Python 3.12.4, macOS 14.5
# Import a toy dataset
from sklearn.datasets import load_breast_cancer
dataset = load_breast_cancer()
# Convert it to Pandas
import pandas as pd # 2.2.2
df_X = pd.DataFrame(data=dataset.data,columns=dataset.feature_names)
s_y = pd.Series(data=dataset.target,name='target')
# Convert the features to boolean
df_X = df_X>df_X.quantile(axis=0,q=0.5).values.reshape(1,-1)
# One way to avoid trees_to_dataframe to crash is to convert the Pandas booleans to integers
do_temporary_solution1=False # try setting it to True
if do_temporary_solution1:
df_X = df_X.astype(int) # One can use this to avoid trees_to_dataframe to crash
# Take a XGBoost model
from xgboost import XGBClassifier # 2.0.3
model = XGBClassifier(random_state=6*7)
# Train the model
model.fit(X=df_X,y=s_y)
# Take the booster
booster = model.get_booster()
# Look at the feature_types
print(booster.feature_types) # 30*['i']
# Another way to avoid trees_to_dataframe to crash is to convert the types 'i' to 'int'
do_temporary_solution2=False # try setting it to True
if do_temporary_solution2:
booster.feature_types = ['int' if feature_type == 'i' else feature_type for feature_type in booster.feature_types]
# Convert to a DataFrame
df_booster = booster.trees_to_dataframe() # ValueError: Failed to parse model text dump.
print(df_booster)
I think that I have found a bug with XGBoost. Suppose that I train a XGBoost model then take the booster via
booster=model.get_booster()
and that this booster has in its feature_types a'i'
(i.e. a boolean variable). Then the methodbooster.trees_to_dataframe()
will crash and give this error:"ValueError: Failed to parse model text dump."
.How I came up with the bug: I recently updated a bunch of Python libraries (especially Pandas 1.5.1 to 2.2.2) and my script which used to worked now crashes at this step
booster.trees_to_dataframe()
. I looked in the source code at https://github.com/dmlc/xgboost in the file python-package/xgboost/core.py for the methodtrees_to_dataframe
. The issue is in this part of the code:The problem is that for a feature_type
'i'
there's no"<"
or":{"
to find in the string so it ends in the "else" part that raises a ValueError.I found two ways to avoid the error on my side:
np.uint8
instead ofbool
(before,np.uint8
was the default resulting dtype forpd.get_dummies
and now it changed tobool
, hence why I now get the bug).booster.trees_to_dataframe()
, cast the booster types as integers instead of booleans viabooster.feature_types = ['int' if feature_type == 'i' else feature_type for feature_type in booster.feature_types]
.In the long term I think that
booster.trees_to_dataframe()
should not crash with a booster where there is a feature type'i'
.Here is a sample script to see that
trees_to_dataframe
crashes when there's a feature type'i'
and that the two suggested techniques do avoid the crash: