Closed tcfuji closed 8 years ago
I've been looking into XGBoost and I'm trying to understand what it adds over sklearn's implementation of GradientBoostingClassifier. Do you know?
It's mostly just a faster version of GradientBoostingClassifier: http://auduno.com/post/96084011658/some-nice-ml-libraries
However, it's also mentioned in a number of Kaggle winning solutions because Gradient Boosting apparently seems to do quite well in those competitions: 1. https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/14335/1st-place-winner-solution-gilberto-titericz-stanislav-semenov
http://blog.kaggle.com/2015/12/21/rossmann-store-sales-winners-interview-1st-place-gert/
http://blog.kaggle.com/2015/12/03/dato-winners-interview-1st-place-mad-professors/
(Just found out it has its own tag on the Kaggle blog): http://blog.kaggle.com/tag/xgboost/
On Mon, Feb 8, 2016 at 9:11 PM Randy Olson notifications@github.com wrote:
I've been looking into XGBoost and I'm trying to understand what it adds over sklearn's implementation of GradientBoostingClassifier. Do you know?
— Reply to this email directly or view it on GitHub https://github.com/rhiever/tpot/issues/81#issuecomment-181671278.
Besides being highly optimized like tcfuji mentioned, I understand it also has the ability to be trained in a distributed fashion. Would it easily interface with pandas though?
@tcfuji: If it works better than sklearn's GradientBoostingClassifier, isn't incredibly slow (in comparison), and the XGBoost library isn't a pain to integrate with, then I'm not opposed to integrating XGBoost into TPOT. Are you free to do a small benchmark on, say, MNIST or CIFAR-10? I'd be interested to see performance in terms of accuracy and training time.
@bartleyn: From my readings, the Python implementation of XGBoost has the exact same interface as all other sklearn classifiers. I don't think that would be a difficulty.
@bartleyn As Randy mentioned, the xgboost python API makes it easy since it can construct its main data structure (DMatrix) from numpy arrays. Also, the tpot
method _train_model_and_predict
converts the pandas dataframe inputs into numpy arrays (using the .values
method).
@rhiever Sure, I'll work on it over the weekend. Want something similar to tutorials/IRIS.ipynb
and tutorials/MNIST.ipynb
while keeping track of the training time?
That sounds good to me, @tcfuji. Thank you!
@rhiever As we discussed yesterday, you wanted me to evaluate the performance of xgboost itself, not my fork.
The results were better than I expected:
from sklearn.datasets import load_digits, make_classification
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from time import perf_counter
import numpy as np
gb = GradientBoostingClassifier()
xgb = XGBClassifier()
MNIST:
digits = load_digits()
X_train_digit, X_test_digit, y_train_digit, y_test_digit = train_test_split(digits.data, digits.target,
train_size=0.75, test_size=0.25)
start = perf_counter()
gb.fit(X_train_digit, y_train_digit)
print(gb.score(X_test_digit, y_test_digit))
print("%f seconds" % (perf_counter() - start))
0.957777777778 6.918697 seconds
start = perf_counter()
xgb.fit(X_train_digit, y_train_digit)
print(np.mean(xgb.predict(X_test_digit) == y_test_digit))
print("%f seconds" % (perf_counter() - start))
0.955555555556 1.720479 seconds
Using the make_classification method:
X, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_classes=10)
X_train_mc, X_test_mc, y_train_mc, y_test_mc = train_test_split(X, y,
train_size=0.7, test_size=0.3)
start = perf_counter()
gb.fit(X_train_mc, y_train_mc)
print(gb.score(X_test_mc, y_test_mc))
print("%f seconds" % (perf_counter() - start))
0.494 763.447380 seconds
start = perf_counter()
xgb.fit(X_train_mc, y_train_mc)
print(np.mean(xgb.predict(X_test_mc) == y_test_mc))
print("%f seconds" % (perf_counter() - start))
0.513 52.425525 seconds
With a few other variations of make_classification (changing the parameters), xgboost consistently performed about 14x faster than scikit GB.
One caveat is that this speed increase is likely due to OpenMP. I think people with Linux and OS X (by running brew install gcc --without-multilib
) this shouldn't be a problem, but it's still another dependency.
Thank you for running these benchmarks, @tcfuji! Is that a hard dependency on OpenMP, or is it optional? I'm concerned that making OpenMP a requirement for TPOT would cut down on its potential user base pretty significantly.
According to the xgboost docs ( https://xgboost.readthedocs.org/en/latest/build.html), it does not appear to be a hard dependency.
On Tue, Feb 16, 2016 at 6:27 PM Randy Olson notifications@github.com wrote:
Thank you for running these benchmarks, @tcfuji https://github.com/tcfuji! Is that a hard dependency on OpenMP, or is it optional? I'm concerned that making OpenMP a requirement for TPOT would cut down on its potential user base pretty significantly.
— Reply to this email directly or view it on GitHub https://github.com/rhiever/tpot/issues/81#issuecomment-184918637.
How do the benchmarks look without OpenMP?
On Tuesday, February 16, 2016, Ted notifications@github.com wrote:
According to the xgboost docs ( https://xgboost.readthedocs.org/en/latest/build.html), it does not appear to be a hard dependency.
On Tue, Feb 16, 2016 at 6:27 PM Randy Olson <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:
Thank you for running these benchmarks, @tcfuji https://github.com/tcfuji! Is that a hard dependency on OpenMP, or is it optional? I'm concerned that making OpenMP a requirement for TPOT would cut down on its potential user base pretty significantly.
— Reply to this email directly or view it on GitHub https://github.com/rhiever/tpot/issues/81#issuecomment-184918637.
— Reply to this email directly or view it on GitHub https://github.com/rhiever/tpot/issues/81#issuecomment-184923018.
Randal S. Olson, Ph.D. Postdoctoral Researcher, Institute for Biomedical Informatics University of Pennsylvania
E-mail: rso@randalolson.com | Twitter: @randal_olson https://twitter.com/randal_olson http://www.randalolson.com
Just ran the same code without OpenMP. As expected, not as fast but still consistently faster (about 2x to 4x) than scikit's GradientBoostingClassifier.
If we can make OpenMP an optional dependency, I think xgboost would be a great addition.
Looks good to me. Just tried running the benchmarks myself and XGBoost looks like a solid improvement over the GradientBoostingClassifier. Easy to pip install too. Go ahead and put together a PR to replace the GradientBoostingClassifier with XGBoost.
Thanks for looking into this, @tcfuji.
:thumbsup:
Hi Randy,
Thanks to XGBoost's scikit-learn API, it was not difficult to replace the scikit-learn GB with xgboost. I created a separate branch provided here: https://github.com/tcfuji/tpot/tree/xgboost
I tested it a little and it seems to be working. Here's an example of an exported pipeline:
Would this be a desirable addition to the master branch? Of course, this would require another dependency (XGBoost itself!).