Open 6nc0r6-1mp6r0 opened 1 week ago
Can confirm the finding (thanks for the very detailed report!), and made the following additional observations:
Yes, XGBoost is order-dependent when there is more than 1 thread. This is caused by floating point summation not being associative:
a + b + c != a + (b +c)
In a parallel environment, we split the data into chunks for multiple threads to consume:
# single thread
res = a + b + c + d
# 2 threads
thread0 = a + b
thread1 = c + d
res = thread0 + thread1
Even with a single thread, it's still order-dependent. The issue is just whether the floating point error can accumulate to a point where it can affect the tree splits.
Thanks @maxaehle for the good ideas. I can reproduce the first observation, but not the second; when I set n_jobs
= 1, the output predictions for 8195 are still quite different on my machine. I have updated my code above to indicate how I tested each observation. (Hopefully I didn't do anything silly.)
@trivialfis, I appreciate the clarifications. However, I am skeptical that the nonassociativity of floating-point summation can explain the major transition in order dependence that I have described:
n_train
, since this threshold would be based on large sums that are themselves based on random data. Yet the order dependence of the output predictions emerges at n_train
= 8194 every time.X_train
that is vastly larger than floating-point precision in Python. I have inserted a block that does this into the code above, and the output shows that the predictions remain (essentially) order independent for n_train
< 8195 and order dependent otherwise.The threshold value 8194 is quite nearly $2^{13} = 8192$, and the block size given in the XGBoost paper cited above is 8 times this: $2^{16} = 65,536$. Could the issue have to do with a use of blocks? Broadly, I am finding it hard to imagine an implementation of XGBoost that naturally/incidentally satisfies both (i) order dependence if and only if n_train
> 8194 and (ii) maxaehle's first observation. Certainly a puzzle!
The paper is a bit outdated for implementation.
Floating point is a main source of such issues,but there is other source. The quantile sketching works on stream of data and prunes the summary as more input comes in. In such case, prune results can be dependent on the arrival order of the data. You can try to verify this by using Quantile DMatrix with get cut. If quantile is the the issue, then it's very likely just floating point errors, the gain calculation is very sensitive there.
Is the pruning method of the quantile sketching algorithm used only above n_train
> 8194, or does it become significantly more order dependent above this threshold? If either is the case, then this sounds like it could be the explanation.
The verification that you mention using get_quantile_cut()
is below.
import numpy as np
from numpy.random import seed
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import xgboost as xgb
M = 2 # number of models to compare
n_disp = 5 # number of elements of y_test_pred[m] to display
np.set_printoptions(precision=5, linewidth=1000, suppress=True)
seed(42)
# ------------------------------------------------------------------------------------------
def main_func():
# Construct X and y
X, y = make_regression(n_samples=10244)
# Split X and y for training and testing
X_train0, X_test, y_train0, y_test = train_test_split(X, y, test_size=0.2)
n_feat = X_train0[0].shape[0] # number of features
for remove_last_datapt in [1, 0]:
# Remove last data point of training dataset if requested and compute n_train
if remove_last_datapt == 1:
X_train = X_train0[:-1]
y_train = y_train0[:-1]
else:
X_train = X_train0
y_train = y_train0
n_train = y_train.shape[0]
model = M * [None]
y_test_pred = M * [None]
for m in range(M):
# Train the model and use it to predict y_test
print('---')
dtest = xgb.DMatrix(X_test, label=y_test)
model[m] = train_model(n_train, X_train, y_train, dtest)
y_test_pred[m] = model[m].predict(dtest)
print(f'n_train = {n_train}')
print(f'y_test_pred[m][:{n_disp}] for m = {m}:')
print(y_test_pred[m][:n_disp])
# ------------------------------------------------------------------------------------------
def train_model(n_train, X_train, y_train, dtest):
# Permute X_train and y_train
p = np.random.permutation(n_train)
X_train = X_train[p]
y_train = y_train[p]
# Construct and train the model
dtrain = xgb.DMatrix(X_train, label=y_train)
evals = [(dtrain, 'train'), (dtest, 'eval')]
params = {}
num_boost_round = 10
model = xgb.train(params, dtrain, num_boost_round=num_boost_round, evals=evals)
print(dtrain.get_quantile_cut())
return model
# ------------------------------------------------------------------------------------------
main_func()
This produces 4 blocks of output. The first line of each block is
[0] train-rmse:137.27606 eval-rmse:140.60710
[0] train-rmse:137.27606 eval-rmse:140.60710
[0] train-rmse:137.23708 eval-rmse:140.54056
[0] train-rmse:137.12214 eval-rmse:140.43421
Is floating-point error adequate to explain the differences between the third and fourth lines? We can investigate this in a casual manner by making a list of n_train
very large and small natural numbers and computing the sum of its elements over various permutations of these elements. The differences between the values of the sums suggest the order of magnitude of the floating-point error in the XGBoost system. Several such sums are computed with the code below.
import numpy as np
from numpy.random import randint, permutation
# Construct x12 and x21
n_half = int(8194/2)
x1 = randint(low=1, high=500, size=(n_half,)).astype(float)
x2 = 1e14 * randint(low=1, high=500, size=(n_half,)).astype(float)
x12 = np.concatenate((x1, x2))
x21 = np.concatenate((x2, x1))
# Construct xp
n = x12.shape[0]
p = permutation(n)
xp = x12[p]
# Construct the floating-point summations from x12, x21, and xp
X12 = (n+1) * [0]
X21 = (n+1) * [0]
Xp = (n+1) * [0]
for k in range(n):
X12[k+1] = X12[k] + x12[k]
X21[k+1] = X21[k] + x21[k]
Xp[k+1] = Xp[k] + xp[k]
print(f'X12[-1] = {X12[-1]}')
print(f'X21[-1] = {X21[-1]}')
print(f'Xp[-1] = {Xp[-1]}')
Running this code yields
X12[-1] = 1.0280530000000102e+20
X21[-1] = 1.028053e+20
Xp[-1] = 1.0280530000000003e+20
So I think we can expect floating-point errors to be absolutely minute, and for floating-point errors to be the source of the discrepancies, we would need to be computing a difference between two quantities that are almost identical (that is, identical out to about 14 decimal places). We would not be doing so on the first round of boosting, right? If I am missing anything let me know.
When I use
XGBRegressor
to construct a boosted tree model from 8194 or fewer data points (i.e.,n_train
$\leq$ 8194, wheren_train
is defined in the code below) and randomly shuffle the data points before training, thefit
method is order independent, meaning that it generates the same predictive model each time that it is called. However, when I do the same for 8195 data points,fit
is order dependent -- it generates a different predictive model for each call. Why is this?I have read this paper on XGBoost and nearly all of the XGBoost documentation, and the non-subsampling algorithms described in both appear to be order independent for all
n_train
. So the source of the order dependence for large-n_train
datasets is the mysterious part.Below is a minimal Python script that illustrates the issue.
One run of this code yields
Note that for
n_train
= 8194,y_test_pred[m][:n_disp]
is the same for allm
, but forn_train
= 8195 it is not.Within the script, observe that I permute the elements of
X_train
andy_train
before each run. I would expect this to have no effect on the model produced by the fitting algorithm given that, to my understanding, the feature values are sorted and binned near the start of the algorithm. However, if I comment out this permutation, the high-n_train
order dependence of the algorithm disappears. Also note that within theXGBRegressor
call,tree_method
can be set to'approx'
,'hist'
, or'auto'
andrandom_state
can be set to a fixed value without eliminating the order dependence at largen_train
.Finally, there are several comments in the XGBoost documentation that might initially seem relevant:
For various reasons, however, I suspect that these notes are either unrelated to or inadequate to explain the abrupt transition to order dependence that I have just described.
Any clarity would be appreciated!