Open Holorite opened 9 months ago
Thank you for raising the issue, I'm currently on holiday and will look into it once the holiday is over.
Hi, I've noticed something similar so adding it here with my example. Apologies in advance if this changes the topic of the issue too much and I'll remove the comment and open up a different one.
In my case, I'm noticing that the output of get_dump()
is outputting different splits than trees_to_dataframe()
. Below, the key is 0.2
coming from trees_to_dataframe()
and 0.200000003
coming from get_dump()
. And m.predict()
is matching trees_to_dataframe()
and not get_dump()
-- in that if I manually recreate the model using the split values from get_dump()
, then I get the wrong values compared to m.predict()
.
code to reproduce:
import numpy as np
import pandas as pd
from xgboost import XGBClassifier
X = [
[0.1, 0, 0.5, np.nan],
[0.1, 1, 0.6, 2],
[np.nan, 0, 0.7, 4],
[0.3, 1, 0.8, 6],
[0.2, 0, 0.9, np.nan],
[0.2, np.nan, 1.0, 0]
] * 4
X = pd.DataFrame(X, columns=["A", "B", "C", "D"])
y = [0, 1, 0, 1, 1, 0] * 4
m = XGBClassifier(n_estimators=5, objective="binary:logistic", base_score=0.2)
_ = m.fit(X, y)
print(m.get_booster().get_dump(dump_format="text"))
print(m.get_booster().trees_to_dataframe())
['0:[B<1] yes=1,no=2,missing=1
1:[A<0.200000003] yes=3,no=4,missing=3
3:leaf=-0.210526332
4:leaf=0.315789461
2:leaf=0.842105329
', '0:[B<1] yes=1,no=2,missing=1
1:[D<6] yes=3,no=4,missing=4
3:leaf=-0.219102889
4:leaf=0.297974169
2:leaf=0.531206012
', '0:[B<1] yes=1,no=2,missing=1
1:[D<6] yes=3,no=4,missing=4
3:leaf=-0.197821245
4:leaf=0.222074687
2:leaf=0.402607888
',
'0:[B<1] yes=1,no=2,missing=1
1:[A<0.200000003] yes=3,no=4,missing=3
3:leaf=-0.205294877
4:leaf=0.213468671
2:leaf=0.331218809
', '0:[B<1] yes=1,no=2,missing=1
1:[D<6] yes=3,no=4,missing=4
3:leaf=-0.186532408
4:leaf=0.165713936
2:leaf=0.284410417
']
Tree Node ID Feature Split Yes No Missing Gain Cover Category
0 0 0 0-0 B 1.0 0-1 0-2 0-1 7.433944 3.840000 NaN
1 0 1 0-1 A 0.2 0-3 0-4 0-3 3.469347 2.560000 NaN
2 0 2 0-2 Leaf NaN NaN NaN NaN 0.842105 1.280000 NaN
3 0 3 0-3 Leaf NaN NaN NaN NaN -0.210526 1.280000 NaN
4 0 4 0-4 Leaf NaN NaN NaN NaN 0.315789 1.280000 NaN
5 1 0 1-0 B 1.0 1-1 1-2 1-1 3.216151 4.500417 NaN
6 1 1 1-1 D 6.0 1-3 1-4 1-4 3.425155 2.641475 NaN
7 1 2 1-2 Leaf NaN NaN NaN NaN 0.531206 1.858942 NaN
8 1 3 1-3 Leaf NaN NaN NaN NaN -0.219103 1.320737 NaN
9 1 4 1-4 Leaf NaN NaN NaN NaN 0.297974 1.320737 NaN
10 2 0 2-0 B 1.0 2-1 2-2 2-1 1.933597 4.696602 NaN
11 2 1 2-1 D 6.0 2-3 2-4 2-4 2.273266 2.696686 NaN
12 2 2 2-2 Leaf NaN NaN NaN NaN 0.402608 1.999916 NaN
13 2 3 2-3 Leaf NaN NaN NaN NaN -0.197821 1.158573 NaN
14 2 4 2-4 Leaf NaN NaN NaN NaN 0.222075 1.538113 NaN
15 3 0 3-0 B 1.0 3-1 3-2 3-1 1.363356 4.629007 NaN
16 3 1 3-1 A 0.2 3-3 3-4 3-3 2.272249 2.703031 NaN
17 3 2 3-2 Leaf NaN NaN NaN NaN 0.331219 1.925976 NaN
18 3 3 3-3 Leaf NaN NaN NaN NaN -0.205295 1.173760 NaN
19 3 4 3-4 Leaf NaN NaN NaN NaN 0.213469 1.529271 NaN
20 4 0 4-0 B 1.0 4-1 4-2 4-1 1.037215 4.450341 NaN
21 4 1 4-1 D 6.0 4-3 4-4 4-4 1.586077 2.689200 NaN
22 4 2 4-2 Leaf NaN NaN NaN NaN 0.284410 1.761141 NaN
23 4 3 4-3 Leaf NaN NaN NaN NaN -0.186532 1.036968 NaN
24 4 4 4-4 Leaf NaN NaN NaN NaN 0.165714 1.652232 NaN
I've noticed this too. If it helps, what I ended up doing is saving the model as ubj
and then reading it back in using the Python module ubjson
. I found that using the binary format had better precision.
Then, since I had already written code to interpret dataframe format I just replaced the Split
column that I got from trees_to_dataframe()
with the parsed data.
Yeah interesting idea, would be nice to get the functionality in the right spot to avoid that extra bit. trees_to_dataframe
has extra parsing after get_dump
which calls float on the splits but get_dump
stays as a string. Maybe from_cstr_to_pystr
(https://github.com/dmlc/xgboost/blob/4847f248402b813e6b797068f2145d28162a02d5/python-package/xgboost/core.py#L94) needs to be more sophisticated to handle it correctly?
It's a floating point serialization issue, for lossless roundtrip between base 2 and base 10, one needs to control both of the encoder and decoder. That means we can have lossless JSON model format but not visualization tools like Python dataframes, since for the former, both encoder and decoder are defined in XGBoost, while the latter has the decoder (base 10 -> base 2) in Python.
The best way to have lossless dataframes is to bypass the text parsing hence bypass the base conversation.
Maybe I'm reading it wrong, but I think the casting in python is actually okay. The original value coming in is already different from what .predict()
is using as far as I can tell. If I print data: CStrPptr
in from_cstr_to_pystr
I see:
b' { "nodeid": 0, "depth": 0, "split": "B", "split_condition": 1, "yes": 1, "no": 2, "missing": 1 , "children": [\n { "nodeid": 1, "depth": 1, "split": "A", "split_condition": 0.200000003, "yes": 3, "no": 4, "missing": 3 , "children": [\n { "nodeid": 3, "leaf": -0.210526332 }, \n { "nodeid": 4, "leaf": 0.315789461 }\n ]}, \n { "nodeid": 2, "leaf": 0.842105329 }\n ]}'
That split_condition should be 0.2 not 0.200000003 to get the predicted value to match .predict()
Dumping tree information using
get_dump
andtrees_to_dataframe
in Python for models with large split conditions causes integer overflow errors. This has only been tested with Python regression models on v2.0.3.How to reproduce
In python:
Note the split condition for feature
very_large_number
:-2.147484e+09 == -INT_MAX
.If we are to trust the output of
trees_to_dataframe
, each prediction for[-10, 2*10**10, 3*10**10]
should return[0.5125, 0.5125, 0.5125]
since each of those values are larger than-2.147484e+09
. However, this is not what is observed from the model.The same integer overflow error can be observed when using
model.get_booster().get_dump()
.However, when using
model.save_model(file)
the split conditions are saved correctly. This can be tested by saving and loading, in which case you will observe identical model outputs indicating there is no data loss in the process.The
trees_to_dataframe()
function calls the underlyingget_dump()
which in turn accessesXGBoosterDumpModelEx
. However,save_model()
usesXGBoosterSaveModel
. Thus, it seems like the problem is likely withXGBoosterDumpModelEx
.This is my first time contributing to this project (or any open source project). Please let me know if there are any issues or if any more information is needed. Thanks!