Failing test case for this library

AltamashRafiq commented 1 year ago

Hi! I have a failing test case for this library that I am trying to understand. This uses the open source diamonds dataset. Could you weigh in on why this is a failing case and what might be going wrong? XGBoost version 1.6.2 which I have tested dumps the same json tree and predictions as version 1.2.0.

Python file for generating the test files:

import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
import xgboost as xgb
from sklearn import datasets
from sklearn.datasets import dump_svmlight_file

# build dataset
diamonds = pd.read_csv("diamonds.csv")
diamonds.price = 1 * (diamonds.price > 3000)

cuts = pd.get_dummies(diamonds.color, drop_first=True)
cuts.columns = [f"cut_{c}" for c in cuts.columns.tolist()]
diamonds = diamonds.drop('cut', axis = 1)
diamonds = pd.concat([diamonds, cuts], axis=1)

clarity = pd.get_dummies(diamonds.clarity, drop_first=True)
clarity.columns = [f"clarity_{c}" for c in clarity.columns.tolist()]
diamonds = diamonds.drop('clarity', axis = 1)
diamonds = pd.concat([diamonds, clarity], axis=1)

color = pd.get_dummies(diamonds.color, drop_first=True)
color.columns = [f"color{c}" for c in color.columns.tolist()]
diamonds = diamonds.drop('color', axis = 1)
diamonds = pd.concat([diamonds, color], axis=1)

X, y = diamonds.drop('price', axis=1), diamonds[['price']]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# dump test data
dump_svmlight_file(X_test.values, y_test.values.ravel(), 'diamonds_test.libsvm')

# passing model
model = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=2,
    n_jobs=-1, 
    verbose_eval=None,
)

model.fit(X_train.values, y_train.values.ravel())

bst = model.get_booster()
y_pred = bst.predict(xgb.DMatrix(X_test.values))

np.savetxt('passing_diamonds_true_prediction.txt', y_pred, delimiter='\t')
bst.dump_model('passing_diamonds_xgboost_dump.json', dump_format='json')

# failing model
model = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=3,
    n_jobs=-1, 
    verbose_eval=None,
)

model.fit(X_train.values, y_train.values.ravel())

bst = model.get_booster()
y_pred = bst.predict(xgb.DMatrix(X_test.values))

np.savetxt('failing_diamonds_true_prediction.txt', y_pred, delimiter='\t')
bst.dump_model('failing_diamonds_xgboost_dump.json', dump_format='json')

Test file in Go:

package main

import (
    "testing"
    xgb "github.com/Elvenson/xgboost-go"
    "gotest.tools/assert"
    "github.com/Elvenson/xgboost-go/activation"
    "github.com/Elvenson/xgboost-go/mat"
)

func TestEnsemble_Diamond_Passing(t *testing.T) {
    modelPath := "passing_diamonds_xgboost_dump.json"
    ensemble, err := xgb.LoadXGBoostFromJSON(modelPath,
        "", 1, 2, &activation.Logistic{})
    assert.NilError(t, err)

    inputPath := "diamonds_test.libsvm"
    input, err := mat.ReadLibsvmFileToSparseMatrix(inputPath)
    assert.NilError(t, err)

    predictions, err := ensemble.PredictProba(input)
    assert.NilError(t, err)

    expectedProbPath := "passing_diamonds_true_prediction.txt"
    expectedProb, err := mat.ReadCSVFileToDenseMatrix(expectedProbPath, "\t", 0.0)
    assert.NilError(t, err)

    err = mat.IsEqualMatrices(&predictions, &expectedProb, 0.0001) // 0.0001
    assert.NilError(t, err)

    // with undefined depth
    ensemble, err = xgb.LoadXGBoostFromJSON(modelPath,
        "", 1, 0, &activation.Logistic{})
    assert.NilError(t, err)

    predictions, err = ensemble.PredictProba(input)
    assert.NilError(t, err)

    err = mat.IsEqualMatrices(&predictions, &expectedProb, 0.0001) // 0.0001
    assert.NilError(t, err)
}

func TestEnsemble_Diamond_Failing(t *testing.T) {
    modelPath := "failing_diamonds_xgboost_dump.json"
    ensemble, err := xgb.LoadXGBoostFromJSON(modelPath,
        "", 1, 3, &activation.Logistic{})
    assert.NilError(t, err)

    inputPath := "diamonds_test.libsvm"
    input, err := mat.ReadLibsvmFileToSparseMatrix(inputPath)
    assert.NilError(t, err)

    predictions, err := ensemble.PredictProba(input)
    assert.NilError(t, err)

    expectedProbPath := "failing_diamonds_true_prediction.txt"
    expectedProb, err := mat.ReadCSVFileToDenseMatrix(expectedProbPath, "\t", 0.0)
    assert.NilError(t, err)

    err = mat.IsEqualMatrices(&predictions, &expectedProb, 0.0001) // 0.0001
    assert.NilError(t, err)

    // with undefined depth
    ensemble, err = xgb.LoadXGBoostFromJSON(modelPath,
        "", 1, 0, &activation.Logistic{})
    assert.NilError(t, err)

    predictions, err = ensemble.PredictProba(input)
    assert.NilError(t, err)

    err = mat.IsEqualMatrices(&predictions, &expectedProb, 0.0001) // 0.0001
    assert.NilError(t, err)
}

AltamashRafiq commented 1 year ago

For some context, there are two models here: one that passes the test and one that fails. The only difference is depth of 2 in the passing model vs 3 in the failing model.

Elvenson commented 1 year ago

Hi there, thank you for raising the issue. I will take a look when I have time.

AltamashRafiq commented 1 year ago

Thank you so much! I've been digging into this and here is some insight for one sample observation

Observation: 0 0:0.9 1:66.40000000000001 2:60 3:5.92 4:5.86 5:3.91 10:1 14:1 23:1

For this sample, only one tree in the forest gives an output that is different to python's xgboost. This tree (leaf node below) returns -0.302633077 instead of 0.0803299546. As we can see, it splits on feature 5 here and the split condition is approximately equal to the raw value of the feature. As 3.91 < 3.91000009, we should go to 0.0803299546 but it is incorrectly interpreted as a value greater than or equal to the threshold. My suspicion is that the threshold 3.91000009 is being accidentally rounded down here.

Problematic leaf node: { "nodeid": 5, "depth": 2, "split": "f5", "split_condition": 3.91000009, "yes": 11, "no": 12, "missing": 11 , "children": [{ "nodeid": 11, "leaf": -0.302633077 }, { "nodeid": 12, "leaf": 0.0803299546 }]}

AltamashRafiq commented 1 year ago

This issue is solved! Turns out the problem originates from xgboost in python, not the xgboost-go library. Xgboost's C++ library internally stores the trees as float32 but dumps the trees as float64 in Python. This means that split conditions like 3.91 become 3.91000009 when dumped to json, eventually leading to differences between this library's and xgboost python's results. This problem can be replicated in numpy as well (see attached). Screen Shot 2023-09-29 at 2 42 30 PM

jcronq commented 1 year ago

XGBoost's c++ library uses float32 for all values. May be worth considering using float32 instead of float64 within xgboost-go for inference parity.

I'm not aware of an issue wrt casting the float64 values in the json to float32 in code. In theory it should reverse the damage done by casting float32's to float64 in the python xgboost lib.

@Elvenson Thoughts?

Elvenson commented 1 year ago

I think you guys are right, let me update the code when I have time. Thanks you guys for your insights!

AltamashRafiq commented 11 months ago

@Elvenson @jcronq PR for this change created here: https://github.com/Elvenson/xgboost-go/pull/13

Elvenson commented 11 months ago

Thanks for your contribution @AltamashRafiq ! I have merged the code and will close this issue now.

Elvenson / xgboost-go

Failing test case for this library #12