Implement equal-frequency binning and other speed up tech from FastBDT?

phunterlau commented 8 years ago

A new paper 'FastBDT: A speed-optimized and cache-friendly implementation of stochastic gradient-boosted decision trees for multivariate classification' http://arxiv.org/abs/1609.06119 introduces some good techniques for speeding up boosting implementation (see section 2.1), and its has good benchmark results comparing with XGBoost on multi-core machines (i7, 4 cores). The open source code is available on https://github.com/thomaskeck/FastBDT do we want to take a look and see if we can get some inspiration for speeding up more on XGBoost?

tqchen commented 8 years ago

It would be great if we can understand better where the difference comes from and break the perf down, there are several factors.

integer vs floating points
- The claim on integer faster than floating point is interesting. I did some comparison study before and did not find too much difference in terms of comparison.
- As for statistics accumulations, it could be possible that integer histograms is faster than floats, however floats are needed for general gradient boosting(customization of objectives).
Pre-binning vs exact greedy
- Single machine version of xgboost uses exact greedy algorithm that searches overall possible candidates, this can be desirable for deeper trees, and is usually what the user want.
- We also have an approximate version that do pre-binning(with distributed approximate quantiles that generalizes to distributed version)
- For binning method, the performance could depend on how the binning is performed, if we want deeper trees on larger dataset, there could be need for more fine-grained bins, or exact greedy method
Row-based subsampling vs column subsampling
- FastBDT uses row based subsampling with ratio=0.5, since it uses row based data structure, there can be a 2x speedup factor
- XGBoost uses column based structure, that is great for column subsampling with speedup (e.g. columsubsample=0.5 and 2x speedup), but not gain speedup with row submsapling.
Missing value handling
- XGBoost uses two sided scanning over missing values, disable that could get a 2x speedup.

So it might be helpful to break it down to do benchmarks on the factors, and see how the two methods are different with respect to tree depth, column/row subsampling and etc.

After we have some of the ideas, we can learn from each other and bring the useful techniques in @phunterlau

phunterlau commented 8 years ago

Thanks @tqchen I just submitted an issue on FastBDT for inviting the author for joining the discussion. I feel excited since this BDT is also originally from a high energy physics experiment Belle II and I would like to have faster and faster BDT for high energy physics, since I used to be a high energy physicist.

thomaskeck commented 8 years ago

@phunterlau Thanks for inviting me to the discussion.

First a general remark. I am aware of the xgboost project and use it myself. FastBDT is a very specialized implementation (I only support classification, no adjustable objective function, no multi-core fitting, no out-of-core functionality, ...) for the needs of the Belle II project. In some algorithms we employ a few hundred multivariate classifiers in an hierarchical structure, others are time critical (high level trigger). Many speed improvements are due to the fact that I am not as general as xgboost.

@tqchen I like to add my opinion on some of your points

The integer vs. floating point: There is another important aspect, the integer binning can directly be used as indices for the histograms. Therefore I avoid using if statements in the hotloops of the algorithm. I hope that in the future the compiler will even be able to autovectorize some of the code (but lets call this claim "ambitious")
Pre-binning vs exact greedy: The equal-frequency binning is also a form of regularization (cut search is effectively performed on the quantiles of the feature distribution) and helps to avoid overfitting. In addition all our measurements (features) in HEP (high energy physics) usually have an uncertainty, hence a binning finer than the resolution will be worse. Finally the binning does not need to be done at application time, because the trees are only interested in the order-statistic and not the values itself. In consequence one can convert the bin-number to an equivalent floating point cut. I found this type of binning useful in many machine learning algorithms (e.g. neural networks, where it helps to deal with outliers).

I'm not familiar with your source-code, so here are some wild guesses: I guess the equal-frequency binning could be interesting for your project (maybe you already have an implementation for this, I don't know). I think it will be difficult to incorporate the FastBDT fitting optimizations into XGBoost without loosing your support for arbitrary problems (regression, classification, with/without binning). On the other hand in principle you should be able to be as fast during the application.

I benchmarked all implementations using C++ and not their python interfaces. The code I used for the benchmarking is available here: https://github.com/thomaskeck/FastBDT/blob/master/examples/performance.cxx

Best regards, Thomas

bwillers commented 8 years ago

@thomaskeck @tqchen I've actually done something similar (using integer binned inputs, which become indexes for histogram access) for regression problems with a variety of loss functions, by keeping track of the sum of G and H, and then doing the gain calculations in terms of these aggreagted arrays. Roughly something like:

for i in range(nrows):
    j = x[i]
    C[j] += 1
    G[j] += g[i]
    H[j] += h[i]

Unfortunately I'm not in a position to share the full code, but I'm sure you can figure it out. It's obviously not as fast as the pure count histogram use case, but it's still noticably faster than doing it over floating point inputs.

I've tried using both a column and row continuous layout, row based makes for better cache efficiency, but column based makes it trivial to parallelize over many cores, just run that loop above for a different feature on each core. I've thought about parallelization over rows instead (with a row continuous layout), but haven't gotten around to it.

RAMitchell commented 8 years ago

In my GPU algorithm I avoided scanning twice to deal with missing values.

I first did a sum (reduction) over all non-missing values and calculated the sum of missing values as the sum from root node - sum of non-missing values.

A single scan can then be performed testing the gain with missing-left and missing-right at the same time.

This helps if the cost of a reduction is significantly less than the cost of a scan.

tqchen commented 7 years ago

supported in https://github.com/dmlc/xgboost/issues/1950

dmlc / xgboost

Implement equal-frequency binning and other speed up tech from FastBDT? #1604