dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.14k stars 8.71k forks source link

Implement equal-frequency binning and other speed up tech from FastBDT? #1604

Closed phunterlau closed 7 years ago

phunterlau commented 8 years ago

A new paper 'FastBDT: A speed-optimized and cache-friendly implementation of stochastic gradient-boosted decision trees for multivariate classification' http://arxiv.org/abs/1609.06119 introduces some good techniques for speeding up boosting implementation (see section 2.1), and its has good benchmark results comparing with XGBoost on multi-core machines (i7, 4 cores). The open source code is available on https://github.com/thomaskeck/FastBDT do we want to take a look and see if we can get some inspiration for speeding up more on XGBoost?

tqchen commented 8 years ago

It would be great if we can understand better where the difference comes from and break the perf down, there are several factors.

So it might be helpful to break it down to do benchmarks on the factors, and see how the two methods are different with respect to tree depth, column/row subsampling and etc.

After we have some of the ideas, we can learn from each other and bring the useful techniques in @phunterlau

phunterlau commented 8 years ago

Thanks @tqchen I just submitted an issue on FastBDT for inviting the author for joining the discussion. I feel excited since this BDT is also originally from a high energy physics experiment Belle II and I would like to have faster and faster BDT for high energy physics, since I used to be a high energy physicist.

thomaskeck commented 8 years ago

@phunterlau Thanks for inviting me to the discussion.

First a general remark. I am aware of the xgboost project and use it myself. FastBDT is a very specialized implementation (I only support classification, no adjustable objective function, no multi-core fitting, no out-of-core functionality, ...) for the needs of the Belle II project. In some algorithms we employ a few hundred multivariate classifiers in an hierarchical structure, others are time critical (high level trigger). Many speed improvements are due to the fact that I am not as general as xgboost.

@tqchen I like to add my opinion on some of your points

I'm not familiar with your source-code, so here are some wild guesses: I guess the equal-frequency binning could be interesting for your project (maybe you already have an implementation for this, I don't know). I think it will be difficult to incorporate the FastBDT fitting optimizations into XGBoost without loosing your support for arbitrary problems (regression, classification, with/without binning). On the other hand in principle you should be able to be as fast during the application.

I benchmarked all implementations using C++ and not their python interfaces. The code I used for the benchmarking is available here: https://github.com/thomaskeck/FastBDT/blob/master/examples/performance.cxx

Best regards, Thomas

bwillers commented 8 years ago

@thomaskeck @tqchen I've actually done something similar (using integer binned inputs, which become indexes for histogram access) for regression problems with a variety of loss functions, by keeping track of the sum of G and H, and then doing the gain calculations in terms of these aggreagted arrays. Roughly something like:

for i in range(nrows):
    j = x[i]
    C[j] += 1
    G[j] += g[i]
    H[j] += h[i]

Unfortunately I'm not in a position to share the full code, but I'm sure you can figure it out. It's obviously not as fast as the pure count histogram use case, but it's still noticably faster than doing it over floating point inputs.

I've tried using both a column and row continuous layout, row based makes for better cache efficiency, but column based makes it trivial to parallelize over many cores, just run that loop above for a different feature on each core. I've thought about parallelization over rows instead (with a row continuous layout), but haven't gotten around to it.

RAMitchell commented 8 years ago

In my GPU algorithm I avoided scanning twice to deal with missing values.

I first did a sum (reduction) over all non-missing values and calculated the sum of missing values as the sum from root node - sum of non-missing values.

A single scan can then be performed testing the gain with missing-left and missing-right at the same time.

This helps if the cost of a reduction is significantly less than the cost of a scan.

tqchen commented 7 years ago

supported in https://github.com/dmlc/xgboost/issues/1950