xgboost output not matching

kumarsameer commented 1 year ago

if True:
    filename = "ind/test/" + 'bnifty'+'_'+str(today)+".ind.ade"
    adf = pd.read_csv(filename,index_col=False)
    tdf = createTrainDf(adf)
    dtrain = xgb.DMatrix(data=tdf[train_cols])
    y_pred = model.predict(dtrain,output_margin=True)
    print("live :",y_pred)

if True:
    model.save_model("model.bin")
    bst = xgb.Booster()
    bst.load_model('model.bin')
    print("bin :",bst.predict(dtrain,output_margin=True))

if True:
    from treelite_runtime import Predictor,DMatrix
    model_path="./tlite2"
    bst_lite = treelite.Model.from_xgboost(model)
    bst_lite.export_lib(toolchain='gcc',params = {},libpath=model_path, verbose=False)
    predictor = Predictor(model_path, verbose=False)
    dmat = DMatrix(tdf[train_cols])
    print("treelite :",predictor.predict(dmat,pred_margin=True))

output:

live : [-1.0892507  -0.2221069  -0.28209284  0.01806552  0.16996847 -0.9235258 ]
bin : [-1.0892507  -0.2221069  -0.28209284  0.01806552  0.16996847 -0.9235258 ]
treelite : [-0.08823351 -0.2221069  -0.3661577   0.01806552  0.30945933 -0.9235258 ]

Every alternate margin output it matching.

kumarsameer commented 1 year ago

Also compiled xgboost c-api. The margin output matches correctly.
Have verified the column names in xgboost dump against treelite's main.c file, and the column names matches correctly.

Any idea what could be the issue is greatly appreciated.

kumarsameer commented 1 year ago

@hcho3 I have looked at similar issue you commented on, and tried everything. Possible for you to throw some light here.

Am using the following package versions :

treelite==3.1.0.dev0
treelite-runtime==3.1.0
xgboost==1.7.3

hcho3 commented 1 year ago

@kumarsameer Can you post your XGBoost model here? I'll try to debug the issue

kumarsameer commented 1 year ago

please find the replication example and modelfile(zip) attached.

import xgboost as xgb
import treelite
import treelite_runtime
import numpy as np

test_data = [-1.62e+02,  3.63e+01,  1.00e+01,  6.00e+00,  1.10e+01,  7.00e+00,
         1.00e+00,  3.00e+00,  0.00e+00,  2.00e+00, -1.00e+00,  3.30e+01,
         7.70e+01,  3.90e+01,  1.11e+01,  1.30e+01, -9.00e+00,  0.00e+00,
        -6.80e+01,  5.40e+01,  1.09e+02, -7.00e-01, -7.00e-01, -3.00e+00,
         9.10e+01, -9.00e+00, -2.00e+00,  1.10e+01,  9.00e+00,  1.50e+01,
        -1.20e+01, -1.80e+01, -6.20e+01,  6.55e+01,  4.50e+01,  5.90e+01,
         1.00e-01, -2.00e+00,  3.00e+00,  1.80e+01, -6.00e+00, -1.80e+01,
        -1.00e+00,  1.00e-01, -2.40e+00,  0.00e+00,  0.00e+00,  0.00e+00,
         0.00e+00,  0.00e+00,  0.00e+00,  0.00e+00,  0.00e+00,  0.00e+00,
         0.00e+00,  0.00e+00,  0.00e+00,  0.00e+00,  0.00e+00,  0.00e+00,
         0.00e+00,  0.00e+00,  0.00e+00,  0.00e+00,  0.00e+00]

modelfile = 'model9.bin'
model = treelite.Model.load(modelfile, model_format='xgboost')
model.export_lib('gcc', 'compiled.dylib', params={'parallel_comp': model.num_tree}, verbose=False)
predictor = treelite_runtime.Predictor(libpath='./compiled.dylib')
dmat = treelite_runtime.DMatrix(test_data)
print(f'Treelite: {predictor.predict(dmat,pred_margin=True)}')

bst = xgb.Booster()
bst.load_model('model9.bin')
dtrain = xgb.DMatrix(data=np.expand_dims(test_data,0))
print("bin :",bst.predict(dtrain,output_margin=True))

This is the output I get on my machine

Treelite: -0.3277367353439331
bin : [0.5290272]

model.zip

kumarsameer commented 1 year ago

The output matches if test_data=list(np.ones(65))

kumarsameer commented 1 year ago

@hcho3 let me know if you need any help in replicating the issue ?

kumarsameer commented 1 year ago

Not sure if it helps, but one of the things i noticed is that the c++ xgboost api also matched only after i set the missing to std::numeric_limits<double>::quiet_NaN()

mchonofsky commented 1 year ago

Hi y'all,

Seeing the same issue here. Attaching a CSV of data and labels. The predictions for the first twenty rows are below - XGB at left, Treelite at right. I'm running Treelite version 3.1.0. I attached the XGBoost JSON and generated C code in the zip file.

models.zip train_labels.csv train_data.csv

0.0 0.15592413
0.0 0.7200409
0.0 0.7200409
0.0 0.9868025
1.0 0.9695979
0.0 0.9695979
0.0 0.90683216
1.0 0.90683216
1.0 0.25082284
1.0 0.6561131
1.0 0.6561131
1.0 0.33845404
1.0 0.33845404
1.0 0.33845404
1.0 0.7200409
1.0 0.0045186426

mchonofsky commented 1 year ago

The model was built in XGBoost with

param = {'max_depth': 6,
 'eta': 0.3,
 'tree_method': 'hist',
 'objective': 'binary:hinge',
 'eval_metric': ['logloss', 'error']}

mchonofsky commented 1 year ago

And here's full replication code

import pandas as pd, numpy as np, treelite, treelite_runtime, xgboost as xgb
from importlib import reload
reload(treelite_runtime)
reload(treelite)
X = pd.read_csv('train_data.csv').to_numpy()[:,1:]
y = pd.read_csv('train_labels.csv').to_numpy()[:,1]
dtrain = xgb.DMatrix(X, label=y)
param = {'max_depth': 6, 'eta': 0.3, 'tree_method': 'hist', 'objective': 'binary:hinge', 'eval_metric':['logloss', 'error']}
bst = xgb.train(param, dtrain, 10, [(dtrain, 'train')])
model = treelite.Model.from_xgboost(bst)
model.export_lib(toolchain='gcc', libpath='./mymodel.so', verbose=True)
preds = bst.predict(xgb.DMatrix(X[0:20,:]))
predictor = treelite_runtime.Predictor('./mymodel.so')
# these should match
for i in range(20): print(preds[i], predictor.predict(treelite_runtime.DMatrix(X[i:i+1,:])))

davidelahoz commented 2 months ago

I'm facing the same issue. Did you found any walkaround?

dmlc / tl2cgen

xgboost output not matching #6