dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.3k stars 8.73k forks source link

Performance regression in fit method with evaluation sets #10793

Open ldesreumaux opened 2 months ago

ldesreumaux commented 2 months ago

I have observed a significant performance regression in XGBoost version 1.7 when using the fit method with evaluation sets in sklearn estimators. The issue appears to have been introduced by this commit, which defaults to using QuantileDMatrix for both training and evaluation sets.

While the optimization of prediction with QuantileDMatrix has been addressed in https://github.com/dmlc/xgboost/issues/9013, there remains a significant performance gap when using QuantileDMatrix for evaluation sets compared to DMatrix.

Here is a sample code to reproduce the issue:

import numpy as np
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
import time

n_samples = 1000000
n_features = 100
seed = 42

np.random.seed(seed)

X = np.random.rand(n_samples, n_features)
y = np.random.randint(0, 2, size=n_samples)

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=seed)
X_eval1, X_eval2, y_eval1, y_eval2 = train_test_split(X_temp, y_temp, test_size=0.5, random_state=seed)

model = XGBClassifier(
    tree_method='hist',
    max_depth=6,
    n_estimators=500,
    eval_metric='logloss',
    random_state=seed
)

start_time = time.time()

model.fit(X_train, y_train, eval_set=[(X_eval1, y_eval1), (X_eval2, y_eval2)], verbose=True)

end_time = time.time()
execution_time = end_time - start_time

y_pred_eval1 = model.predict(X_eval1)
y_pred_eval2 = model.predict(X_eval2)

accuracy_eval1 = accuracy_score(y_eval1, y_pred_eval1)
accuracy_eval2 = accuracy_score(y_eval2, y_pred_eval2)

print(f"Accuracy on Evaluation Set 1: {accuracy_eval1:.4f}")
print(f"Accuracy on Evaluation Set 2: {accuracy_eval2:.4f}")

print(f"Execution Time: {execution_time:.2f} seconds")

Performance comparison (with current master branch):

Here are profiling graphs for the two cases:

The graphs clearly show that the performance degradation is linked to the prediction step with QuantileDMatrix for evaluation sets.

This sample code uses synthetic data, but I have observed the same order of magnitude of performance degradation with a real-world dataset.

If no further optimization is possible, I would suggest to change the default behavior to use a simple DMatrix for the evaluation sets.

trivialfis commented 2 months ago

I agree that the gap is unexpectedly large. The choice of QDM is for reduced memory usage as it compresses the data. But there's a cost in data lookup during prediction. I will try to see what can be done there. Maybe use in place predict, maybe optimize the value lookup a bit more.