dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.1k stars 8.7k forks source link

Integer overflow in `get_dump` and `trees_to_dataframe` #10035

Open Holorite opened 7 months ago

Holorite commented 7 months ago

Dumping tree information using get_dump and trees_to_dataframe in Python for models with large split conditions causes integer overflow errors. This has only been tested with Python regression models on v2.0.3.

How to reproduce

In python:

model = XGBRegressor(n_jobs=100, n_estimators=1, min_child_weight=1, subsample=1, learning_rate=0.05, max_depth=4, random_state=1)
input = pd.DataFrame({'very_large_number': [2*10**10, 3*10**10]})  
output = pd.DataFrame({'true_results': [0, 1]})

model.fit(input, output)

model.get_booster().trees_to_dataframe()
>>>    Tree  Node   ID            Feature         Split  Yes   No Missing    Gain  Cover  Category
>>> 0     0     0  0-0  very_large_number -2.147484e+09  0-1  0-2     0-2  0.2500    2.0       NaN
>>> 1     0     1  0-1               Leaf           NaN  NaN  NaN     NaN -0.0125    1.0       NaN
>>> 2     0     2  0-2               Leaf           NaN  NaN  NaN     NaN  0.0125    1.0       NaN

model.predict([-10, 2*10**10, 3*10**10])
>>> [0.4875 0.4875 0.5125]

Note the split condition for feature very_large_number: -2.147484e+09 == -INT_MAX.

If we are to trust the output of trees_to_dataframe, each prediction for [-10, 2*10**10, 3*10**10] should return [0.5125, 0.5125, 0.5125] since each of those values are larger than -2.147484e+09. However, this is not what is observed from the model.

The same integer overflow error can be observed when using model.get_booster().get_dump().

However, when using model.save_model(file) the split conditions are saved correctly. This can be tested by saving and loading, in which case you will observe identical model outputs indicating there is no data loss in the process.

The trees_to_dataframe() function calls the underlying get_dump() which in turn accesses XGBoosterDumpModelEx. However, save_model() uses XGBoosterSaveModel. Thus, it seems like the problem is likely with XGBoosterDumpModelEx.

This is my first time contributing to this project (or any open source project). Please let me know if there are any issues or if any more information is needed. Thanks!

trivialfis commented 7 months ago

Thank you for raising the issue, I'm currently on holiday and will look into it once the holiday is over.

bbernst commented 3 months ago

Hi, I've noticed something similar so adding it here with my example. Apologies in advance if this changes the topic of the issue too much and I'll remove the comment and open up a different one.

In my case, I'm noticing that the output of get_dump() is outputting different splits than trees_to_dataframe(). Below, the key is 0.2 coming from trees_to_dataframe() and 0.200000003 coming from get_dump(). And m.predict() is matching trees_to_dataframe() and not get_dump() -- in that if I manually recreate the model using the split values from get_dump(), then I get the wrong values compared to m.predict().

code to reproduce:

import numpy as np
import pandas as pd
from xgboost import XGBClassifier

X = [
    [0.1, 0, 0.5, np.nan],
    [0.1, 1, 0.6, 2],
    [np.nan, 0, 0.7, 4],
    [0.3, 1, 0.8, 6],
    [0.2, 0, 0.9, np.nan],
    [0.2, np.nan, 1.0, 0]
] * 4
X = pd.DataFrame(X, columns=["A", "B", "C", "D"])
y = [0, 1, 0, 1, 1, 0] * 4

m = XGBClassifier(n_estimators=5, objective="binary:logistic", base_score=0.2)
_ = m.fit(X, y)
print(m.get_booster().get_dump(dump_format="text"))
print(m.get_booster().trees_to_dataframe())
['0:[B<1] yes=1,no=2,missing=1
    1:[A<0.200000003] yes=3,no=4,missing=3
      3:leaf=-0.210526332
      4:leaf=0.315789461
    2:leaf=0.842105329
', '0:[B<1] yes=1,no=2,missing=1
    1:[D<6] yes=3,no=4,missing=4
      3:leaf=-0.219102889
      4:leaf=0.297974169
    2:leaf=0.531206012
', '0:[B<1] yes=1,no=2,missing=1
    1:[D<6] yes=3,no=4,missing=4
      3:leaf=-0.197821245
      4:leaf=0.222074687
    2:leaf=0.402607888
',
'0:[B<1] yes=1,no=2,missing=1
    1:[A<0.200000003] yes=3,no=4,missing=3
        3:leaf=-0.205294877
        4:leaf=0.213468671
    2:leaf=0.331218809
', '0:[B<1] yes=1,no=2,missing=1
    1:[D<6] yes=3,no=4,missing=4
        3:leaf=-0.186532408
        4:leaf=0.165713936
    2:leaf=0.284410417
']

    Tree  Node   ID Feature  Split  Yes   No Missing      Gain     Cover  Category
0      0     0  0-0       B    1.0  0-1  0-2     0-1  7.433944  3.840000       NaN
1      0     1  0-1       A    0.2  0-3  0-4     0-3  3.469347  2.560000       NaN
2      0     2  0-2    Leaf    NaN  NaN  NaN     NaN  0.842105  1.280000       NaN
3      0     3  0-3    Leaf    NaN  NaN  NaN     NaN -0.210526  1.280000       NaN
4      0     4  0-4    Leaf    NaN  NaN  NaN     NaN  0.315789  1.280000       NaN
5      1     0  1-0       B    1.0  1-1  1-2     1-1  3.216151  4.500417       NaN
6      1     1  1-1       D    6.0  1-3  1-4     1-4  3.425155  2.641475       NaN
7      1     2  1-2    Leaf    NaN  NaN  NaN     NaN  0.531206  1.858942       NaN
8      1     3  1-3    Leaf    NaN  NaN  NaN     NaN -0.219103  1.320737       NaN
9      1     4  1-4    Leaf    NaN  NaN  NaN     NaN  0.297974  1.320737       NaN
10     2     0  2-0       B    1.0  2-1  2-2     2-1  1.933597  4.696602       NaN
11     2     1  2-1       D    6.0  2-3  2-4     2-4  2.273266  2.696686       NaN
12     2     2  2-2    Leaf    NaN  NaN  NaN     NaN  0.402608  1.999916       NaN
13     2     3  2-3    Leaf    NaN  NaN  NaN     NaN -0.197821  1.158573       NaN
14     2     4  2-4    Leaf    NaN  NaN  NaN     NaN  0.222075  1.538113       NaN
15     3     0  3-0       B    1.0  3-1  3-2     3-1  1.363356  4.629007       NaN
16     3     1  3-1       A    0.2  3-3  3-4     3-3  2.272249  2.703031       NaN
17     3     2  3-2    Leaf    NaN  NaN  NaN     NaN  0.331219  1.925976       NaN
18     3     3  3-3    Leaf    NaN  NaN  NaN     NaN -0.205295  1.173760       NaN
19     3     4  3-4    Leaf    NaN  NaN  NaN     NaN  0.213469  1.529271       NaN
20     4     0  4-0       B    1.0  4-1  4-2     4-1  1.037215  4.450341       NaN
21     4     1  4-1       D    6.0  4-3  4-4     4-4  1.586077  2.689200       NaN
22     4     2  4-2    Leaf    NaN  NaN  NaN     NaN  0.284410  1.761141       NaN
23     4     3  4-3    Leaf    NaN  NaN  NaN     NaN -0.186532  1.036968       NaN
24     4     4  4-4    Leaf    NaN  NaN  NaN     NaN  0.165714  1.652232       NaN
Holorite commented 3 months ago

I've noticed this too. If it helps, what I ended up doing is saving the model as ubj and then reading it back in using the Python module ubjson. I found that using the binary format had better precision.

Then, since I had already written code to interpret dataframe format I just replaced the Split column that I got from trees_to_dataframe() with the parsed data.

bbernst commented 3 months ago

Yeah interesting idea, would be nice to get the functionality in the right spot to avoid that extra bit. trees_to_dataframe has extra parsing after get_dump which calls float on the splits but get_dump stays as a string. Maybe from_cstr_to_pystr (https://github.com/dmlc/xgboost/blob/4847f248402b813e6b797068f2145d28162a02d5/python-package/xgboost/core.py#L94) needs to be more sophisticated to handle it correctly?

trivialfis commented 3 months ago

It's a floating point serialization issue, for lossless roundtrip between base 2 and base 10, one needs to control both of the encoder and decoder. That means we can have lossless JSON model format but not visualization tools like Python dataframes, since for the former, both encoder and decoder are defined in XGBoost, while the latter has the decoder (base 10 -> base 2) in Python.

The best way to have lossless dataframes is to bypass the text parsing hence bypass the base conversation.

bbernst commented 3 months ago

Maybe I'm reading it wrong, but I think the casting in python is actually okay. The original value coming in is already different from what .predict() is using as far as I can tell. If I print data: CStrPptr in from_cstr_to_pystr I see:

b'  { "nodeid": 0, "depth": 0, "split": "B", "split_condition": 1, "yes": 1, "no": 2, "missing": 1 , "children": [\n    { "nodeid": 1, "depth": 1, "split": "A", "split_condition": 0.200000003, "yes": 3, "no": 4, "missing": 3 , "children": [\n      { "nodeid": 3, "leaf": -0.210526332 }, \n      { "nodeid": 4, "leaf": 0.315789461 }\n    ]}, \n    { "nodeid": 2, "leaf": 0.842105329 }\n  ]}'

That split_condition should be 0.2 not 0.200000003 to get the predicted value to match .predict()