Open ldesreumaux opened 2 months ago
I agree that the gap is unexpectedly large. The choice of QDM is for reduced memory usage as it compresses the data. But there's a cost in data lookup during prediction. I will try to see what can be done there. Maybe use in place predict, maybe optimize the value lookup a bit more.
I have observed a significant performance regression in XGBoost version 1.7 when using the fit method with evaluation sets in sklearn estimators. The issue appears to have been introduced by this commit, which defaults to using QuantileDMatrix for both training and evaluation sets.
While the optimization of prediction with QuantileDMatrix has been addressed in https://github.com/dmlc/xgboost/issues/9013, there remains a significant performance gap when using QuantileDMatrix for evaluation sets compared to DMatrix.
Here is a sample code to reproduce the issue:
Performance comparison (with current master branch):
Here are profiling graphs for the two cases:
The graphs clearly show that the performance degradation is linked to the prediction step with QuantileDMatrix for evaluation sets.
This sample code uses synthetic data, but I have observed the same order of magnitude of performance degradation with a real-world dataset.
If no further optimization is possible, I would suggest to change the default behavior to use a simple DMatrix for the evaluation sets.