Changing the order of rows in a toy dataset yields dramatically different predictions for XGBRanker

joshHug commented 7 months ago

Consider the simple script below:

import pandas as pd
import numpy as np
import xgboost as xgb

def fit_and_print_ranker_dump_and_predictions(X, y, qids):
    model = xgb.XGBRanker(objective='rank:ndcg', seed=0, random_state=0)
    model.fit(X, y,  qid=qids)
    print(model.get_booster().get_dump(dump_format='text')[0])
    print(pd.Series(model.predict(X), index=X.index))

X = pd.DataFrame([
    [7.5, 3.5],
    [1.5, 4.0],
    [7.0, 2.0],
    [3.0, 8.0],
    [4.5, 3.5],
    [8.0, 8.0],
], columns=['feature1', 'feature2'])

y = pd.Series(
    [0, 
     0, 
     1, # relevant item in group 0
     1, # relevant item in group 1
     0, 
     0])

qids = [0, 0, 0, 1, 1, 1]

print("XGBranker on original dataset yields the tree shown below, followed by the predictions shown below:")
fit_and_print_ranker_dump_and_predictions(X, y, qids)

# now permute the rows
permutation = [2, 1, 0, 3, 4, 5]
X_permuted = X.iloc[permutation, :]
y_permuted = y[permutation]
print("XGBranker on permuted dataset yields the tree shown below, followed by the predictions shown below:")
fit_and_print_ranker_dump_and_predictions(X_permuted, y_permuted, qids)

On my machine (M1 Macbook), I get:

XGBranker on original dataset yields the tree shown below, followed by the predictions shown below:
0:[feature1<5.75] yes=1,no=2,missing=1
    1:leaf=0.026957728
    2:leaf=-0.0239090417

0    0.356739
1    0.643589
2    0.356739
3    0.643589
4    0.582729
5    0.356739
dtype: float32
XGBranker on permuted dataset yields the tree shown below, followed by the predictions shown below:
0:[feature1<5.75] yes=1,no=2,missing=1
    1:leaf=-0.0151539575
    2:leaf=0.0168569554

2    0.796365
1   -0.174696
0    0.796365
3    0.264934
4    0.375095
5    0.686204
dtype: float32

Note that the rankers completely disagree. In the first dataset, the highest ranked item in the first group is item 1 (score of 0.64), and in the permuted dataset, item 1 is the lowest ranked item (score of -0.17).

On a real world dataset with hundreds of thousands of rows, we also found that permuting a single pair of rows can lead to massive differences in the resulting rankers and predictions. Also, we found that swapping two feature values (out of more than 100 features) for only a single row can also lead to massive differences. And also, we found that training the same model with the same data on two different machines can also lead to massive differences. Curiously, the degree of difference is approximately the same for all three cases, i.e. if we look at various metrics, and create baseline_model, model_with_two_swapped_rows, model_with_two_swapped_values_in_a_single_row, and model_trained_on_other_machine, any pairwise comparison of the two rows shows nearly the same degree of impact to the recs (recs include same top rec, same top 3 recs, same top 10 recs).

trivialfis commented 7 months ago

Hi, thank you for raising the issue, could you please share the XGBoost version you are using?

joshHug commented 7 months ago

Apparently I was running 1.7.3. When I use 2.0.3 the issue of order affecting the prediction is not as dramatic (for whatever reason I had to up to 5 samples per group otherwise everything was yielding a prediction of zero):

import pandas as pd
import numpy as np
import xgboost as xgb

print(xgb.__version__)

def fit_and_print_ranker_dump_and_predictions(X, y, qids):
    model = xgb.XGBRanker(objective='rank:ndcg', seed=0, random_state=0)
    model.fit(X, y,  qid=qids)
    print(model.get_booster().get_dump(dump_format='text')[0])
    print(pd.Series(model.predict(X), index=X.index))

X = pd.DataFrame([
    [7.5, 3.5],
    [1.5, 4.0],
    [7.0, 2.0],
    [4.5, 9.0],
    [3.0, 7.0],
    [3.0, 8.0],
    [4.5, 3.5],
    [8.0, 8.0],
    [8.5, 6.5],
    [3.0, 2.0],    
], columns=['feature1', 'feature2'])

y = pd.Series(
    [1, 
     0, 
     0, # relevant item in group 0
     0,
     0,
     1, # relevant item in group 1
     0, 
     0,
     0,
     0])

qids = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

print("XGBranker on original dataset yields the tree shown below, followed by the predictions shown below:")
fit_and_print_ranker_dump_and_predictions(X, y, qids)

# now permute the rows
permutation = [0, 1, 2, 3, 4, 
               6, 5, 7, 8, 9]
X_permuted = X.iloc[permutation, :]
y_permuted = y[permutation]
print("XGBranker on permuted dataset yields the tree shown below, followed by the predictions shown below:")
fit_and_print_ranker_dump_and_predictions(X_permuted, y_permuted, qids)

This yields for me:

XGBranker on original dataset yields the tree shown below, followed by the predictions shown below:
0:[feature1<7.5] yes=1,no=2,missing=2
    1:leaf=-0.0385963917
    2:leaf=0.0519683175

0    0.526385
1   -0.750671
2   -1.084007
3   -0.429535
4   -0.833963
5    0.798201
6   -1.086628
7   -0.335336
8   -1.131081
9   -0.044666
dtype: float32
XGBranker on permuted dataset yields the tree shown below, followed by the predictions shown below:
0:[feature1<7] yes=1,no=2,missing=2
    1:leaf=-0.0550594814
    2:leaf=0.0609122328

0    0.554155
1   -0.778710
2   -1.003170
3   -0.398918
4   -0.786649
6   -1.180543
5    0.803940
7   -0.190640
8   -0.998423
9   -0.214787
dtype: float32

That is, the order of the rows still has an impact in the latest xgboost, but the impact is much less than in 1.7.3.

trivialfis commented 7 months ago

We revised the LTR objectives in 2.0, and it's using topk as the default strategy instead of sampling. You can visit the tutorial page in XGB's document for more information, I have added lots of details in there, hope that helps.

dmlc / xgboost

Changing the order of rows in a toy dataset yields dramatically different predictions for XGBRanker #10025