EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.74k stars 1.57k forks source link

Gradient Boosting with XGBoost #81

Closed tcfuji closed 8 years ago

tcfuji commented 8 years ago

Hi Randy,

Thanks to XGBoost's scikit-learn API, it was not difficult to replace the scikit-learn GB with xgboost. I created a separate branch provided here: https://github.com/tcfuji/tpot/tree/xgboost

I tested it a little and it seems to be working. Here's an example of an exported pipeline:

import numpy as np
import pandas as pd

from sklearn.cross_validation import StratifiedShuffleSplit
from xgboost import XGBClassifier

# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR')
training_indices, testing_indices = next(iter(StratifiedShuffleSplit(tpot_data['class'].values, n_iter=1, train_size=0.75, test_size=0.25)))

result1 = tpot_data.copy()

# Perform classification with an eXtreme gradient boosting classifier
xgbc1 = XGBClassifier(learning_rate=0.01, n_estimators=42, max_depth=94)
xgbc1.fit(result1.loc[training_indices].drop('class', axis=1).values, result1.loc[training_indices, 'class'].values)
result1['xgbc1-classification'] = xgbc1.predict(result1.drop('class', axis=1).values)

Would this be a desirable addition to the master branch? Of course, this would require another dependency (XGBoost itself!).

rhiever commented 8 years ago

I've been looking into XGBoost and I'm trying to understand what it adds over sklearn's implementation of GradientBoostingClassifier. Do you know?

tcfuji commented 8 years ago

It's mostly just a faster version of GradientBoostingClassifier: http://auduno.com/post/96084011658/some-nice-ml-libraries

However, it's also mentioned in a number of Kaggle winning solutions because Gradient Boosting apparently seems to do quite well in those competitions: 1. https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/14335/1st-place-winner-solution-gilberto-titericz-stanislav-semenov

  1. https://github.com/dmlc/xgboost/tree/master/demo/kaggle-higgs
  2. https://github.com/daxiongshu/kaggle-tradeshift-winning-solution
  3. http://blog.kaggle.com/2015/12/21/rossmann-store-sales-winners-interview-1st-place-gert/

  4. http://blog.kaggle.com/2015/12/03/dato-winners-interview-1st-place-mad-professors/

(Just found out it has its own tag on the Kaggle blog): http://blog.kaggle.com/tag/xgboost/

On Mon, Feb 8, 2016 at 9:11 PM Randy Olson notifications@github.com wrote:

I've been looking into XGBoost and I'm trying to understand what it adds over sklearn's implementation of GradientBoostingClassifier. Do you know?

— Reply to this email directly or view it on GitHub https://github.com/rhiever/tpot/issues/81#issuecomment-181671278.

bartleyn commented 8 years ago

Besides being highly optimized like tcfuji mentioned, I understand it also has the ability to be trained in a distributed fashion. Would it easily interface with pandas though?

rhiever commented 8 years ago

@tcfuji: If it works better than sklearn's GradientBoostingClassifier, isn't incredibly slow (in comparison), and the XGBoost library isn't a pain to integrate with, then I'm not opposed to integrating XGBoost into TPOT. Are you free to do a small benchmark on, say, MNIST or CIFAR-10? I'd be interested to see performance in terms of accuracy and training time.

@bartleyn: From my readings, the Python implementation of XGBoost has the exact same interface as all other sklearn classifiers. I don't think that would be a difficulty.

tcfuji commented 8 years ago

@bartleyn As Randy mentioned, the xgboost python API makes it easy since it can construct its main data structure (DMatrix) from numpy arrays. Also, the tpot method _train_model_and_predict converts the pandas dataframe inputs into numpy arrays (using the .values method).

@rhiever Sure, I'll work on it over the weekend. Want something similar to tutorials/IRIS.ipynb and tutorials/MNIST.ipynb while keeping track of the training time?

rhiever commented 8 years ago

That sounds good to me, @tcfuji. Thank you!

tcfuji commented 8 years ago

@rhiever As we discussed yesterday, you wanted me to evaluate the performance of xgboost itself, not my fork.

The results were better than I expected:

from sklearn.datasets import load_digits, make_classification
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from time import perf_counter
import numpy as np

gb = GradientBoostingClassifier()
xgb = XGBClassifier()

MNIST:

digits = load_digits()
X_train_digit, X_test_digit, y_train_digit, y_test_digit = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)

start = perf_counter()
gb.fit(X_train_digit, y_train_digit)
print(gb.score(X_test_digit, y_test_digit))
print("%f seconds" % (perf_counter() - start))

0.957777777778 6.918697 seconds

start = perf_counter()
xgb.fit(X_train_digit, y_train_digit)
print(np.mean(xgb.predict(X_test_digit) == y_test_digit))
print("%f seconds" % (perf_counter() - start))

0.955555555556 1.720479 seconds


Using the make_classification method:

X, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_classes=10)
X_train_mc, X_test_mc, y_train_mc, y_test_mc = train_test_split(X, y,
                                                    train_size=0.7, test_size=0.3)
start = perf_counter()
gb.fit(X_train_mc, y_train_mc)
print(gb.score(X_test_mc, y_test_mc))
print("%f seconds" % (perf_counter() - start))

0.494 763.447380 seconds

start = perf_counter()
xgb.fit(X_train_mc, y_train_mc)
print(np.mean(xgb.predict(X_test_mc) == y_test_mc))
print("%f seconds" % (perf_counter() - start))

0.513 52.425525 seconds

With a few other variations of make_classification (changing the parameters), xgboost consistently performed about 14x faster than scikit GB.

One caveat is that this speed increase is likely due to OpenMP. I think people with Linux and OS X (by running brew install gcc --without-multilib) this shouldn't be a problem, but it's still another dependency.

rhiever commented 8 years ago

Thank you for running these benchmarks, @tcfuji! Is that a hard dependency on OpenMP, or is it optional? I'm concerned that making OpenMP a requirement for TPOT would cut down on its potential user base pretty significantly.

tcfuji commented 8 years ago

According to the xgboost docs ( https://xgboost.readthedocs.org/en/latest/build.html), it does not appear to be a hard dependency.

On Tue, Feb 16, 2016 at 6:27 PM Randy Olson notifications@github.com wrote:

Thank you for running these benchmarks, @tcfuji https://github.com/tcfuji! Is that a hard dependency on OpenMP, or is it optional? I'm concerned that making OpenMP a requirement for TPOT would cut down on its potential user base pretty significantly.

— Reply to this email directly or view it on GitHub https://github.com/rhiever/tpot/issues/81#issuecomment-184918637.

rhiever commented 8 years ago

How do the benchmarks look without OpenMP?

On Tuesday, February 16, 2016, Ted notifications@github.com wrote:

According to the xgboost docs ( https://xgboost.readthedocs.org/en/latest/build.html), it does not appear to be a hard dependency.

On Tue, Feb 16, 2016 at 6:27 PM Randy Olson <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

Thank you for running these benchmarks, @tcfuji https://github.com/tcfuji! Is that a hard dependency on OpenMP, or is it optional? I'm concerned that making OpenMP a requirement for TPOT would cut down on its potential user base pretty significantly.

— Reply to this email directly or view it on GitHub https://github.com/rhiever/tpot/issues/81#issuecomment-184918637.

— Reply to this email directly or view it on GitHub https://github.com/rhiever/tpot/issues/81#issuecomment-184923018.

Randal S. Olson, Ph.D. Postdoctoral Researcher, Institute for Biomedical Informatics University of Pennsylvania

E-mail: rso@randalolson.com | Twitter: @randal_olson https://twitter.com/randal_olson http://www.randalolson.com

tcfuji commented 8 years ago

Just ran the same code without OpenMP. As expected, not as fast but still consistently faster (about 2x to 4x) than scikit's GradientBoostingClassifier.

If we can make OpenMP an optional dependency, I think xgboost would be a great addition.

rhiever commented 8 years ago

Looks good to me. Just tried running the benchmarks myself and XGBoost looks like a solid improvement over the GradientBoostingClassifier. Easy to pip install too. Go ahead and put together a PR to replace the GradientBoostingClassifier with XGBoost.

Thanks for looking into this, @tcfuji.

tcfuji commented 8 years ago

83

rhiever commented 8 years ago

83 merged.

tcfuji commented 8 years ago

:thumbsup: