Treelite gives different predictions than base XGBoost model

juliuscoburger commented 2 months ago

I noticed that my model returns different scores than the original model. I was able to boil the issue down to using a base_score during training. Can it be that this value is not being translated?

Code to replicate the issue:

import numpy as np
import xgboost as xgb
import treelite

np.random.seed(42)
N = 10
X = np.random.random((N, 10))
y = np.random.random((N,))
dtrain = xgb.DMatrix(X, label=y)
bst = xgb.train({
    'objective': 'count:poisson'
}, dtrain, 10)
bst.save_model('/tmp/bst.json')
tl_model = treelite.frontend.load_xgboost_model('/tmp/bst.json')
# Treelite gives the same predictions as xgboost
np.testing.assert_almost_equal(treelite.gtil.predict(tl_model, data=X).squeeze(), bst.predict(dtrain))

# Poisson will fail for sufficiently high predictions, see https://github.com/dmlc/xgboost/issues/10486
y = np.random.random((N,)) * 3000
dtrain = xgb.DMatrix(X, label=y)
# But the issue can be mitigated by setting sufficiently high base score
bst = xgb.train({
    'objective': 'count:poisson',
    'base_score': 3000
}, dtrain, 10)
bst.save_model('/tmp/bst.json')

tl_model = treelite.frontend.load_xgboost_model('/tmp/bst.json')
# Unfortunatelly treelite now gives different predictions
np.testing.assert_almost_equal(treelite.gtil.predict(tl_model, data=X).squeeze(), bst.predict(dtrain))

hcho3 commented 2 months ago

It looks like the floating-point error starts to creep in, from two sources:

Use of large base_scores
Order of summation is different in XGBoost's predictor and Treelite GTIL

The check passes if you relax the required tolerance:

 np.testing.assert_almost_equal(treelite.gtil.predict(tl_model, data=X).squeeze(), bst.predict(dtrain), decimal=2)

juliuscoburger commented 2 months ago

The check is just there to showcase that the scores are not equal. I was under the impression that GTIL always returns the same scores.

hcho3 commented 2 months ago

Can it be that this value is not being translated?

I double checked and base_scores is being properly translated and handled. So the error is not due to logic error.

I was under the impression that GTIL always returns the same scores.

GTIL may evaluate trees and leaf nodes in different order as XGBoost. Addition of floating-point values is not associative (a + (b + c) != (a+b) +c in general), and error may accumulate, especially if some values in the sum are much larger than the others, like in this example.

To minimize error due to floating-point arithmetic, consider scaling the target by using StandardScaler.

hcho3 commented 2 months ago

I'll probably have to add a note to the documentation for GTIL about possibility of floating-point error and how to mitigate it.

dmlc / treelite

Treelite gives different predictions than base XGBoost model #585