EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.69k stars 1.57k forks source link

XGBoost parameter error (colsample_bytree=1) #449

Open tjvananne opened 7 years ago

tjvananne commented 7 years ago

I've seen some traffic on these issues regarding potentially getting rid of xgboost altogether due to dependency troubles, so if that is the case then this isn't relevant.

I am receiving the following error message:

Optimization Progress:   0%|                            | 26/10100 [00:22<2:10:08,  1.29pipeline/s][
08:35:46] c:\dev\libs\xgboost\dmlc-core\include\dmlc\./logging.h:235: [08:35:46] C:\dev\libs\xgboost
\src\tree\updater_colmaker.cc:162: Check failed: (n) > (0U) colsample_bytree=1 is too small that no
feature can be included

I know that the colsample_bytree parameter should be what proportion of the features you're allowing each tree to randomly sample from in order to build itself. So a colsample_bytree=1 should be telling each tree to sample from 100% of the columns/features when building a tree. (Please correct me if I'm wrong on that!)

xgboost colsample_bytree = subsample ratio of columns when constructing each tree.

This has also been previously raised as an issue on xgboost's github repo, but that issue was closed without really any explanation of what the user was doing wrong.

My guess is that this would be an error with what parameters are being passed into XGBoost and not necessarily an xgboost issue.

Context of the issue

My environment:

Process to reproduce the issue

This is my simple script to reproduce the error in my environment with random data. This error doesn't tend to occur when my generations and population_size are low (around 10-15 each). I have experienced this issue with generation/population_size as low as 32 (with this same script below). Hopefully this short script is sufficiently reproducible!

print("importing modules...")
import pandas as pd
import numpy as np
import tpot
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
import sklearn
import scipy
import xgboost as xgb
from random import randint

# I wanted the label data to be a bit imbalanced
print("creating fake data...")
np.random.seed(1776)
df = pd.DataFrame(np.random.randn(8000,11), columns=list("ABCDEFGHIJK"))
label = np.array([randint(1,11) for mynumber in range(0, 8000)])
label[label <= 9] = 0
label[label >= 10] = 1
print(label)
df['label'] = label

# extract labels and drop them from the DataFrame
y = df['label'].values
colsToDrop = ['label']
xdf = df.drop(colsToDrop, axis=1)

x_train, x_test, y_train, y_test = train_test_split(xdf, y, train_size=0.7, test_size=0.3, random_state=1776)

# this will error out:
tpot = TPOTClassifier(generations=100, population_size=100, verbosity=2, 
scoring="balanced_accuracy", cv=5, random_state=1776)
tpot.fit(x_train, y_train)

I couldn't find any prior issues that addressed this specific error I keep running into, but I apologize if I may have missed one.

weixuanfu commented 7 years ago

Hmm, the warning message was indeed reproduced in my environment. But it is very weird that this colsample_bytree parameter is not in our operator dictionary so TPOT wound not tune this parameter and keep it as default value of 1.

I also checked the first 26 pipelines in TPOT optimization process. Only two pipelines below used XGBClassifier. I tested both of them and both worked without warning message. Very weird.

Pipeline(steps=[('xgbclassifier', XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True,
       subsample=0.6500000000000001))])

Pipeline(steps=[('xgbclassifier', XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.5, max_delta_step=0, max_depth=4,
       min_child_weight=7, missing=None, n_estimators=100, nthread=1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=0.8))])
tjvananne commented 7 years ago

I should have been a bit more explicit for anyone else following along. In addition to the message in the console, I also receive the windows message that "python.exe has stopped working" and then the program crashes.

image

Also, I'm not sure if "Pipeline" is a 1-to-1 with "generation" (haven't dug into too much of the source yet), but it did require me to set the "generations" parameter of the TPOTClassifier object sufficiently high enough to receive the error. So maybe it isn't one of the XGBClassifiers in the first 26 pipelines, but rather one of the pipelines later on?

I might be misunderstanding the relationships between pipelines and generations though.

In the script above for example, if I set the generations and population_size both to 32, then I get this error message, but if I lower those parameters each to 30, then there is no error. That seems to be where the threshold is for reproducing this issue.

Note: (Nevermind, I just tested it with 30 as the value for both generations and population_size and it received the same error message, but not until it was 23% done with the optimization)

The tpot object was able to fit with no errors when generations and population_size were both set to 26.

Going to try and investigate this more tonight.

weixuanfu commented 7 years ago

Thank you for these detailed information for this issue.

I don't think the issue was related to generation since the error message showed up in the initial generation which had only randomly generated pipelines. I suspected it might be due to _pre_test decorator because it would test pipeline with a small dataset to make sure it is a valid pipeline. Some invalid pipelines including XGBClassifier operator might cause this issue in _per_test. I will also run more tests to find the reason.

tjvananne commented 7 years ago

Absolutely! Thanks for your response!

That reminds me, it's also probably worth mentioning that I ran into a few Invalid pipeline encountered. Skipping its evaluation. messages when using verbosity=3 in the TPOTClassifier() constructor.

Thank you!

weixuanfu commented 7 years ago

I found the reason of the issue. It is due to pipeline 32 in generation 0 (see below) when using the demo in the issue. The first step is feature selection. But sadly no feature passed the threshold in the first step so that no feature is available for XGBClassifier in second steps.

For solving this issue, I will submit a PR to catch this error message to prevent TPOT from crashing.

# pipeline 32
make_pipeline(
    SelectFromModel(estimator=ExtraTreesClassifier(max_features=0.2), threshold=0.30000000000000004),
    XGBClassifier(learning_rate=0.1, max_depth=1, min_child_weight=13, nthread=1, subsample=0.6000000000000001)
)

For reproducing the error message without running TPOT, please try the codes below:

print("importing modules...")
import pandas as pd
import numpy as np
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import FunctionTransformer
from sklearn.ensemble import VotingClassifier, ExtraTreesClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.feature_selection import SelectFromModel
from xgboost import XGBClassifier
from random import randint
from copy import copy
from tpot import TPOTClassifier

# I wanted the label data to be a bit imbalanced
print("creating fake data...")
np.random.seed(1776)
df = pd.DataFrame(np.random.randn(8000,11), columns=list("ABCDEFGHIJK"))
label = np.array([randint(1,11) for mynumber in range(0, 8000)])
label[label <= 9] = 0
label[label >= 10] = 1
print(label)
df['label'] = label

# extract labels and drop them from the DataFrame
y = df['label'].values
colsToDrop = ['label']
xdf = df.drop(colsToDrop, axis=1)

x_train, x_test, y_train, y_test = train_test_split(xdf, y, train_size=0.7, test_size=0.3, random_state=1776)

# make a test pipeline
"""test_pipeline = make_pipeline(
    make_union(VotingClassifier([("est", DecisionTreeClassifier(criterion="gini", max_depth=10, min_samples_leaf=8, min_samples_split=13))]), FunctionTransformer(copy)),
    XGBClassifier(learning_rate=0.01, max_depth=2, min_child_weight=9, nthread=1, subsample=0.1)
)"""

test_pipeline = make_pipeline(
    SelectFromModel(estimator=ExtraTreesClassifier(max_features=0.2), threshold=0.30000000000000004),
    XGBClassifier(learning_rate=0.1, max_depth=1, min_child_weight=13, nthread=1, subsample=0.6000000000000001)
    )

# # Fix random state when the operator allows  (optional) just for get consistent CV score in TPOT
tpot = TPOTClassifier()
tpot._set_param_recursive(test_pipeline.steps, 'random_state', 42)

# cv scores
cvscores = cross_val_score(test_pipeline, x_train, y_train, cv=5, scoring='accuracy', verbose=0)
weixuanfu commented 7 years ago
from xgboost.core import XGBoostError
import warnings
for i in range(2000):
    try:
    # cv scores
        with warnings.catch_warnings():
            warnings.simplefilter('ignore')
            cvscores = cross_val_score(test_pipeline, x_train, y_train, cv=5, scoring='accuracy', verbose=0)
    except XGBoostError:
        print("Wrong")

Somehow, the error message still showed up even I used the codes above to catch XGBoostError. But program did not crash after runing this bad pipeine 2000 times.

weixuanfu commented 7 years ago

From this part of source code in xgboost, it seems that the error message is printed out by std::ostringstream. I am not sure if python can catch this message.

tjvananne commented 7 years ago

Ah that makes sense, good catch!

Would it be acceptable to just suppress any pipelines that don't meet certain conditions (in this case, not passing in any features due to no features meeting the feature-selection threshold) so they don't get scored / crossed over with other pipelines?

I see what you're saying though, it would probably be best to try to use XGBoosts built-in error checking from a maintainability perspective, right?

weixuanfu commented 7 years ago

Thank you for these good ideas.

It is hard to tell whether the feature-selection step would remove all features before running the pipeline, and it also depends on data. We will refine parameters in selectors (#423) to prevent this issue.

In my codes above, I tried XGBoosts built-in error XGBoostError in their python wrapper but it still printed out the error message even the program is still running. I think std::ostringstream in xgboost C++ source codes printed out error in stdout. It is very strange.

bicepjai commented 7 years ago

Do we know why this issue occurs ? it will be helpful to know why "colsample_bytree=1 is too small that no feature can be included" occurs.

weixuanfu commented 7 years ago

The reason is that feature-selection step in a pipeline can exclude all features before running xgboost. We need better control on feature numbers within pipeline.

privefl commented 7 years ago

I have the same issue.

> devtools::session_info()
Session info --------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.0 (2017-04-21)
 system   x86_64, mingw32             
 ui       RStudio (1.0.153)           
 language (EN)                        
 collate  French_France.1252          
 tz       Europe/Paris                
 date     2017-08-24                  

Packages ------------------------------------------------------------------------------------------
 package       * version    date       source                             
 acepack         1.4.1      2016-10-29 CRAN (R 3.4.1)                     
 backports       1.1.0      2017-05-22 CRAN (R 3.4.0)                     
 base          * 3.4.0      2017-04-21 local                              
 base64enc       0.1-3      2015-07-28 CRAN (R 3.4.0)                     
 bigmemory       4.5.19     2016-03-28 CRAN (R 3.4.1)                     
 bigmemory.sri   0.1.3      2014-08-18 CRAN (R 3.4.0)                     
 bigsnpr       * 0.1.0.9001 2017-08-24 local                              
 bigstatsr     * 0.1.0.9002 2017-08-24 local                              
 checkmate       1.8.3      2017-07-03 CRAN (R 3.4.1)                     
 cluster         2.0.6      2017-03-10 CRAN (R 3.4.0)                     
 codetools       0.2-15     2016-10-05 CRAN (R 3.4.0)                     
 colorspace      1.3-2      2016-12-14 CRAN (R 3.4.0)                     
 compiler        3.4.0      2017-04-21 local                              
 crayon          1.3.2.9000 2017-07-22 Github (gaborcsardi/crayon@750190f)
 data.table      1.10.4     2017-02-01 CRAN (R 3.4.0)                     
 datasets      * 3.4.0      2017-04-21 local                              
 devtools        1.13.3     2017-08-02 CRAN (R 3.4.1)                     
 digest          0.6.12     2017-01-27 CRAN (R 3.4.0)                     
 foreach       * 1.4.3      2015-10-13 CRAN (R 3.4.0)                     
 foreign         0.8-67     2016-09-13 CRAN (R 3.4.0)                     
 Formula       * 1.2-2      2017-07-10 CRAN (R 3.4.1)                     
 ggplot2       * 2.2.1.9000 2017-07-23 Github (hadley/ggplot2@331977e)    
 graphics      * 3.4.0      2017-04-21 local                              
 grDevices     * 3.4.0      2017-04-21 local                              
 grid            3.4.0      2017-04-21 local                              
 gridExtra       2.2.1      2016-02-29 CRAN (R 3.4.0)                     
 gtable          0.2.0      2016-02-26 CRAN (R 3.4.0)                     
 Hmisc         * 4.0-3      2017-05-02 CRAN (R 3.4.1)                     
 htmlTable       1.9        2017-01-26 CRAN (R 3.4.1)                     
 htmltools       0.3.6      2017-04-28 CRAN (R 3.4.0)                     
 htmlwidgets     0.9        2017-07-10 CRAN (R 3.4.1)                     
 iterators       1.0.8      2015-10-13 CRAN (R 3.4.0)                     
 knitr           1.17       2017-08-10 CRAN (R 3.4.1)                     
 lattice       * 0.20-35    2017-03-25 CRAN (R 3.4.0)                     
 latticeExtra    0.6-28     2016-02-09 CRAN (R 3.4.1)                     
 lazyeval        0.2.0      2016-06-12 CRAN (R 3.4.0)                     
 magrittr      * 1.5        2014-11-22 CRAN (R 3.4.0)                     
 Matrix        * 1.2-9      2017-03-14 CRAN (R 3.4.0)                     
 memoise         1.1.0      2017-04-21 CRAN (R 3.4.0)                     
 methods       * 3.4.0      2017-04-21 local                              
 munsell         0.4.3      2016-02-13 CRAN (R 3.4.0)                     
 nnet            7.3-12     2016-02-02 CRAN (R 3.4.0)                     
 parallel        3.4.0      2017-04-21 local                              
 plyr            1.8.4      2016-06-08 CRAN (R 3.4.0)                     
 R6              2.2.2      2017-06-17 CRAN (R 3.4.1)                     
 RColorBrewer    1.1-2      2014-12-07 CRAN (R 3.4.0)                     
 Rcpp            0.12.12    2017-07-15 CRAN (R 3.4.1)                     
 rlang           0.1.2      2017-08-09 CRAN (R 3.4.1)                     
 rpart           4.1-11     2017-03-13 CRAN (R 3.4.0)                     
 rstudioapi      0.6        2016-06-27 CRAN (R 3.4.0)                     
 scales          0.4.1.9002 2017-07-23 Github (hadley/scales@6db7b6f)     
 splines         3.4.0      2017-04-21 local                              
 stats         * 3.4.0      2017-04-21 local                              
 stringi         1.1.5      2017-04-07 CRAN (R 3.4.0)                     
 stringr         1.2.0      2017-02-18 CRAN (R 3.4.0)                     
 survival      * 2.41-3     2017-04-04 CRAN (R 3.4.0)                     
 testthat      * 1.0.2      2016-04-23 CRAN (R 3.4.0)                     
 tibble          1.3.4      2017-08-22 CRAN (R 3.4.0)                     
 tools           3.4.0      2017-04-21 local                              
 utils         * 3.4.0      2017-04-21 local                              
 withr           2.0.0      2017-07-28 CRAN (R 3.4.1)                     
 xgboost         0.6-4      2017-01-05 CRAN (R 3.4.0)