Approximation method is slower than the exact method

dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

https://xgboost.readthedocs.io/en/stable/

Apache License 2.0

26.26k stars 8.72k forks source link

Approximation method is slower than the exact method #8883

Open falcon8241 opened 1 year ago

falcon8241 commented 1 year ago

I am comparing the computational performance of approximation method vs. exact method over a few benchmark datasets. It is my impression that we should expect less computation using approximation method as a less number of split points are evaluated in each tree splitting. This is the case for most of the benchmark datasets but one. This dataset is for a binary classification task, with around 2.2M training samples and 434 features (both continuous and encoded categorial features). In this dataset, the approximation method (setting tree_method='approx') is about 20% slower than the exact method. I cannot share the data as it is proprietary, but would like to possible ways to investigate what is the reason here.

trivialfis commented 1 year ago

Could you please share the sparsity of your data? Also, would you like to try the categorical data support in XGBoost? It should be in good shape for approx for 1.7.4

falcon8241 commented 1 year ago

For the sparsity of the data, I made the following histogram on the percentage of null values in each feature. abc

I have not tried categorical data support yet. But I assume it is not the major reason, as I used the same data (same samples/feature values) for a regression task on a different target and the approx is much faster than exact. However, the approx is slow when I use this data for a classification task.

trivialfis commented 1 year ago

thank you for sharing, I think that's the quantile sketching along with the histogram index generation causing the problem. It has specialization for sparse and dense data, but when feature sparsity varies from feature to feature it can be slow.

I mentioned cat data support as it helps performance by eliminating the need for explicit one-hot encoding, which generates a sparse matrix embedded in a dense matrix.

falcon8241 commented 1 year ago

Thank you. Could you elaborate a bit more on "histogram index generation" and its specialization for sparse and dense data? Why it causes problem?

I can work around the problem using hist method instead of approx. But I want to dig deeper to understand more regarding the algorithms in XGBoost.

trivialfis commented 1 year ago

Hi, some brief descriptions of various tree methods can be found at https://xgboost.readthedocs.io/en/stable/treemethod.html .

Basically, the approx tree method needs to run sketching to renew the histogram boundaries for each iteration using the hessian value as weights. For some regression targets like rmse, the hessian is constant (1.0) and the step is skipped. But for binary classification, sketching must be performed. Hence the performance difference you see.

After finding out the histogram boundaries, we need to assign a bin value for each entry in the feature matrix, which we usually call histogram index generation.