dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.1k stars 8.7k forks source link

booster.trees_to_dataframe crashes when there are boolean feature_types 'i' #10437

Open Noe-AC opened 3 months ago

Noe-AC commented 3 months ago

I think that I have found a bug with XGBoost. Suppose that I train a XGBoost model then take the booster via booster=model.get_booster() and that this booster has in its feature_types a 'i' (i.e. a boolean variable). Then the method booster.trees_to_dataframe() will crash and give this error: "ValueError: Failed to parse model text dump.".

How I came up with the bug: I recently updated a bunch of Python libraries (especially Pandas 1.5.1 to 2.2.2) and my script which used to worked now crashes at this step booster.trees_to_dataframe(). I looked in the source code at https://github.com/dmlc/xgboost in the file python-package/xgboost/core.py for the method trees_to_dataframe. The issue is in this part of the code:

if fid[0].find("<") != -1: ...
elif fid[0].find(":{") != -1: ...
else: raise ValueError("Failed to parse model text dump.")

The problem is that for a feature_type 'i' there's no "<" or ":{" to find in the string so it ends in the "else" part that raises a ValueError.

I found two ways to avoid the error on my side:

  1. Explicitly cast Pandas DataFrame boolean columns as np.uint8 instead of bool (before, np.uint8 was the default resulting dtype for pd.get_dummies and now it changed to bool, hence why I now get the bug).
  2. Before using booster.trees_to_dataframe(), cast the booster types as integers instead of booleans via booster.feature_types = ['int' if feature_type == 'i' else feature_type for feature_type in booster.feature_types].

In the long term I think that booster.trees_to_dataframe() should not crash with a booster where there is a feature type 'i'.

Here is a sample script to see that trees_to_dataframe crashes when there's a feature type 'i' and that the two suggested techniques do avoid the crash:

# Python 3.12.4, macOS 14.5
# Import a toy dataset
from sklearn.datasets import load_breast_cancer
dataset = load_breast_cancer()
# Convert it to Pandas
import pandas as pd # 2.2.2
df_X = pd.DataFrame(data=dataset.data,columns=dataset.feature_names)
s_y = pd.Series(data=dataset.target,name='target')
# Convert the features to boolean
df_X = df_X>df_X.quantile(axis=0,q=0.5).values.reshape(1,-1)
# One way to avoid trees_to_dataframe to crash is to convert the Pandas booleans to integers
do_temporary_solution1=False # try setting it to True
if do_temporary_solution1:
    df_X = df_X.astype(int) # One can use this to avoid trees_to_dataframe to crash 
# Take a XGBoost model
from xgboost import XGBClassifier # 2.0.3
model = XGBClassifier(random_state=6*7)
# Train the model
model.fit(X=df_X,y=s_y)
# Take the booster
booster = model.get_booster()
# Look at the feature_types
print(booster.feature_types) # 30*['i']
# Another way to avoid trees_to_dataframe to crash is to convert the types 'i' to 'int'
do_temporary_solution2=False # try setting it to True
if do_temporary_solution2:
    booster.feature_types = ['int' if feature_type == 'i' else feature_type for feature_type in booster.feature_types]
# Convert to a DataFrame
df_booster = booster.trees_to_dataframe() # ValueError: Failed to parse model text dump.
print(df_booster)
leonya57 commented 1 week ago

Looks like C++ writer has three formats: Quantitive, Categorical and Indicative but python parser does not implement the last one.

https://github.com/dmlc/xgboost/blob/f52f11e1d7c3e2c5b065f8fca6defc818089cebc/src/tree/tree_model.cc#L293C58-L293C66