dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.14k stars 8.71k forks source link

Why does the regression model produced by XGBoost depend on the order of the training data when more than 8194 data points are used? #10834

Open 6nc0r6-1mp6r0 opened 1 week ago

6nc0r6-1mp6r0 commented 1 week ago

When I use XGBRegressor to construct a boosted tree model from 8194 or fewer data points (i.e., n_train $\leq$ 8194, where n_train is defined in the code below) and randomly shuffle the data points before training, the fit method is order independent, meaning that it generates the same predictive model each time that it is called. However, when I do the same for 8195 data points, fit is order dependent -- it generates a different predictive model for each call. Why is this?

I have read this paper on XGBoost and nearly all of the XGBoost documentation, and the non-subsampling algorithms described in both appear to be order independent for all n_train. So the source of the order dependence for large-n_train datasets is the mysterious part.

Below is a minimal Python script that illustrates the issue.

import numpy as np
from numpy.random import seed, uniform
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from copy import deepcopy
from xgboost import XGBRegressor

M = 2  # number of models to compare
ep = 10**-9  # used in a test of the effect of nonassociativity of floating-point addition
n_disp = 5  # number of elements of y_test_pred[m] to display
tree_method = 'approx'  # tree_method of XGBRegressor.  Also try 'hist' and 'exact'.
n_jobs = 1  # number of parallel threads used to run XGBoost
np.set_printoptions(precision=5, linewidth=1000, suppress=True)
seed(42)

# ------------------------------------------------------------------------------------------
def main_func():

    # Construct X and y
    X, y = make_regression(n_samples=10244)

    # Split X and y for training and testing
    X_train0, X_test, y_train0, y_test = train_test_split(X, y, test_size=0.2)
    n_feat = X_train0[0].shape[0]  # number of features

    for remove_last_datapt in [1, 0]:

        # Remove last data point of training dataset if requested and compute n_train
        if remove_last_datapt == 1:
            X_train1 = X_train0[:-1]
            y_train1 = y_train0[:-1]
        else:
            X_train1 = X_train0
            y_train1 = y_train0
        n_train = y_train1.shape[0]

        model = M * [None]
        y_test_pred = M * [None]
        for m in range(M):

            # Add noise to X_train1 and y_train1 to gauge whether nonassociativity of floating-point
            # addition might be causing order dependence
            X_train = deepcopy(X_train1)
            y_train = deepcopy(y_train1)
            for k in range(n_train):
                X_train[k] = X_train1[k] + uniform(low=-ep, high=ep, size=(n_feat,))
            y_train = y_train1 + uniform(low=-ep, high=ep, size=(n_train,))

            # Train the model and use it to predict y_test
            model[m] = train_model(n_train, X_train, y_train, X_test, y_test)
            y_test_pred[m] = model[m].predict(X_test)
            print('---')
            print(f'n_train = {n_train}')
            print(f'y_test_pred[m][:{n_disp}] for m = {m}:')
            print(y_test_pred[m][:n_disp])

# ------------------------------------------------------------------------------------------
def train_model(n_train, X_train, y_train, X_test, y_test):

    # Permute X_train and y_train
    p = np.random.permutation(n_train)
    # p = np.random.permutation(n_train-1)
    # p = np.append(p, n_train-1)
    X_train = X_train[p]
    y_train = y_train[p]

    # Construct and train the model
    model = XGBRegressor(tree_method=tree_method, nthread=n_jobs, random_state=42)
    model.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test)], verbose=0)

    return model

# ------------------------------------------------------------------------------------------
main_func()

One run of this code yields

---
n_train = 8194
y_test_pred[m][:5] for m = 0:
[-226.1524   -26.02231   58.33525  284.14474 -152.88277]
---
n_train = 8194
y_test_pred[m][:5] for m = 1:
[-226.1524   -26.02231   58.33525  284.14474 -152.88277]
---
n_train = 8195
y_test_pred[m][:5] for m = 0:
[-274.84207  -80.22304   44.25071  253.39555 -102.62756]
---
n_train = 8195
y_test_pred[m][:5] for m = 1:
[-145.69212  -41.92562   68.96463  279.14572  -78.55302]

Note that for n_train = 8194, y_test_pred[m][:n_disp] is the same for all m, but for n_train = 8195 it is not.

Within the script, observe that I permute the elements of X_train and y_train before each run. I would expect this to have no effect on the model produced by the fitting algorithm given that, to my understanding, the feature values are sorted and binned near the start of the algorithm. However, if I comment out this permutation, the high-n_train order dependence of the algorithm disappears. Also note that within the XGBRegressor call, tree_method can be set to 'approx', 'hist', or 'auto' and random_state can be set to a fixed value without eliminating the order dependence at large n_train.

Finally, there are several comments in the XGBoost documentation that might initially seem relevant:

For various reasons, however, I suspect that these notes are either unrelated to or inadequate to explain the abrupt transition to order dependence that I have just described.

Any clarity would be appreciated!

maxaehle commented 4 days ago

Can confirm the finding (thanks for the very detailed report!), and made the following additional observations:

trivialfis commented 4 days ago

Yes, XGBoost is order-dependent when there is more than 1 thread. This is caused by floating point summation not being associative:

a + b + c != a + (b +c)

In a parallel environment, we split the data into chunks for multiple threads to consume:

# single thread
res = a + b + c + d
# 2 threads
thread0 = a + b
thread1 = c + d
res = thread0 + thread1
trivialfis commented 4 days ago

Even with a single thread, it's still order-dependent. The issue is just whether the floating point error can accumulate to a point where it can affect the tree splits.

6nc0r6-1mp6r0 commented 2 days ago

Thanks @maxaehle for the good ideas. I can reproduce the first observation, but not the second; when I set n_jobs = 1, the output predictions for 8195 are still quite different on my machine. I have updated my code above to indicate how I tested each observation. (Hopefully I didn't do anything silly.)

@trivialfis, I appreciate the clarifications. However, I am skeptical that the nonassociativity of floating-point summation can explain the major transition in order dependence that I have described:

The threshold value 8194 is quite nearly $2^{13} = 8192$, and the block size given in the XGBoost paper cited above is 8 times this: $2^{16} = 65,536$. Could the issue have to do with a use of blocks? Broadly, I am finding it hard to imagine an implementation of XGBoost that naturally/incidentally satisfies both (i) order dependence if and only if n_train > 8194 and (ii) maxaehle's first observation. Certainly a puzzle!

trivialfis commented 2 days ago

The paper is a bit outdated for implementation.

Floating point is a main source of such issues,but there is other source. The quantile sketching works on stream of data and prunes the summary as more input comes in. In such case, prune results can be dependent on the arrival order of the data. You can try to verify this by using Quantile DMatrix with get cut. If quantile is the the issue, then it's very likely just floating point errors, the gain calculation is very sensitive there.

6nc0r6-1mp6r0 commented 1 day ago

Is the pruning method of the quantile sketching algorithm used only above n_train > 8194, or does it become significantly more order dependent above this threshold? If either is the case, then this sounds like it could be the explanation.

The verification that you mention using get_quantile_cut() is below.

import numpy as np
from numpy.random import seed
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import xgboost as xgb

M = 2  # number of models to compare
n_disp = 5  # number of elements of y_test_pred[m] to display
np.set_printoptions(precision=5, linewidth=1000, suppress=True)
seed(42)

# ------------------------------------------------------------------------------------------
def main_func():

    # Construct X and y
    X, y = make_regression(n_samples=10244)

    # Split X and y for training and testing
    X_train0, X_test, y_train0, y_test = train_test_split(X, y, test_size=0.2)
    n_feat = X_train0[0].shape[0]  # number of features

    for remove_last_datapt in [1, 0]:

        # Remove last data point of training dataset if requested and compute n_train
        if remove_last_datapt == 1:
            X_train = X_train0[:-1]
            y_train = y_train0[:-1]
        else:
            X_train = X_train0
            y_train = y_train0
        n_train = y_train.shape[0]

        model = M * [None]
        y_test_pred = M * [None]
        for m in range(M):

            # Train the model and use it to predict y_test
            print('---')
            dtest = xgb.DMatrix(X_test, label=y_test)
            model[m] = train_model(n_train, X_train, y_train, dtest)
            y_test_pred[m] = model[m].predict(dtest)
            print(f'n_train = {n_train}')
            print(f'y_test_pred[m][:{n_disp}] for m = {m}:')
            print(y_test_pred[m][:n_disp])

# ------------------------------------------------------------------------------------------
def train_model(n_train, X_train, y_train, dtest):

    # Permute X_train and y_train
    p = np.random.permutation(n_train)
    X_train = X_train[p]
    y_train = y_train[p]

    # Construct and train the model
    dtrain = xgb.DMatrix(X_train, label=y_train)
    evals = [(dtrain, 'train'), (dtest, 'eval')]
    params = {}
    num_boost_round = 10
    model = xgb.train(params, dtrain, num_boost_round=num_boost_round, evals=evals)

    print(dtrain.get_quantile_cut())

    return model

# ------------------------------------------------------------------------------------------
main_func()

This produces 4 blocks of output. The first line of each block is

[0]     train-rmse:137.27606    eval-rmse:140.60710
[0]     train-rmse:137.27606    eval-rmse:140.60710
[0]     train-rmse:137.23708    eval-rmse:140.54056
[0]     train-rmse:137.12214    eval-rmse:140.43421

Is floating-point error adequate to explain the differences between the third and fourth lines? We can investigate this in a casual manner by making a list of n_train very large and small natural numbers and computing the sum of its elements over various permutations of these elements. The differences between the values of the sums suggest the order of magnitude of the floating-point error in the XGBoost system. Several such sums are computed with the code below.

import numpy as np
from numpy.random import randint, permutation

# Construct x12 and x21
n_half = int(8194/2)
x1 = randint(low=1, high=500, size=(n_half,)).astype(float)
x2 = 1e14 * randint(low=1, high=500, size=(n_half,)).astype(float)
x12 = np.concatenate((x1, x2))
x21 = np.concatenate((x2, x1))

# Construct xp
n = x12.shape[0]
p = permutation(n)
xp = x12[p]

# Construct the floating-point summations from x12, x21, and xp
X12 = (n+1) * [0]
X21 = (n+1) * [0]
Xp  = (n+1) * [0]
for k in range(n):
    X12[k+1] = X12[k] + x12[k]
    X21[k+1] = X21[k] + x21[k]
    Xp[k+1]  = Xp[k]  + xp[k]
print(f'X12[-1] = {X12[-1]}')
print(f'X21[-1] = {X21[-1]}')
print(f'Xp[-1]  = {Xp[-1]}')

Running this code yields

X12[-1] = 1.0280530000000102e+20
X21[-1] = 1.028053e+20
Xp[-1]  = 1.0280530000000003e+20

So I think we can expect floating-point errors to be absolutely minute, and for floating-point errors to be the source of the discrepancies, we would need to be computing a difference between two quantities that are almost identical (that is, identical out to about 14 decimal places). We would not be doing so on the first round of boosting, right? If I am missing anything let me know.