Open tjvananne opened 7 years ago
Hmm, the warning message was indeed reproduced in my environment. But it is very weird that this colsample_bytree
parameter is not in our operator dictionary so TPOT wound not tune this parameter and keep it as default value of 1.
I also checked the first 26 pipelines in TPOT optimization process. Only two pipelines below used XGBClassifier. I tested both of them and both worked without warning message. Very weird.
Pipeline(steps=[('xgbclassifier', XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
min_child_weight=1, missing=None, n_estimators=100, nthread=1,
objective='binary:logistic', reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, seed=0, silent=True,
subsample=0.6500000000000001))])
Pipeline(steps=[('xgbclassifier', XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
gamma=0, learning_rate=0.5, max_delta_step=0, max_depth=4,
min_child_weight=7, missing=None, n_estimators=100, nthread=1,
objective='binary:logistic', reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, seed=0, silent=True, subsample=0.8))])
I should have been a bit more explicit for anyone else following along. In addition to the message in the console, I also receive the windows message that "python.exe has stopped working" and then the program crashes.
Also, I'm not sure if "Pipeline" is a 1-to-1 with "generation" (haven't dug into too much of the source yet), but it did require me to set the "generations" parameter of the TPOTClassifier
object sufficiently high enough to receive the error. So maybe it isn't one of the XGBClassifiers in the first 26 pipelines, but rather one of the pipelines later on?
I might be misunderstanding the relationships between pipelines and generations though.
In the script above for example, if I set the generations
and population_size
both to 32, then I get this error message, but if I lower those parameters each to 30, then there is no error. That seems to be where the threshold is for reproducing this issue.
Note:
(Nevermind, I just tested it with 30 as the value for both generations
and population_size
and it received the same error message, but not until it was 23% done with the optimization)
The tpot object was able to fit with no errors when generations
and population_size
were both set to 26.
Going to try and investigate this more tonight.
Thank you for these detailed information for this issue.
I don't think the issue was related to generation
since the error message showed up in the initial generation which had only randomly generated pipelines. I suspected it might be due to _pre_test
decorator because it would test pipeline with a small dataset to make sure it is a valid pipeline. Some invalid pipelines including XGBClassifier operator might cause this issue in _per_test
. I will also run more tests to find the reason.
Absolutely! Thanks for your response!
That reminds me, it's also probably worth mentioning that I ran into a few Invalid pipeline encountered. Skipping its evaluation.
messages when using verbosity=3
in the TPOTClassifier()
constructor.
Thank you!
I found the reason of the issue. It is due to pipeline 32 in generation 0 (see below) when using the demo in the issue. The first step is feature selection. But sadly no feature passed the threshold in the first step so that no feature is available for XGBClassifier
in second steps.
For solving this issue, I will submit a PR to catch this error message to prevent TPOT from crashing.
# pipeline 32
make_pipeline(
SelectFromModel(estimator=ExtraTreesClassifier(max_features=0.2), threshold=0.30000000000000004),
XGBClassifier(learning_rate=0.1, max_depth=1, min_child_weight=13, nthread=1, subsample=0.6000000000000001)
)
For reproducing the error message without running TPOT, please try the codes below:
print("importing modules...")
import pandas as pd
import numpy as np
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import FunctionTransformer
from sklearn.ensemble import VotingClassifier, ExtraTreesClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.feature_selection import SelectFromModel
from xgboost import XGBClassifier
from random import randint
from copy import copy
from tpot import TPOTClassifier
# I wanted the label data to be a bit imbalanced
print("creating fake data...")
np.random.seed(1776)
df = pd.DataFrame(np.random.randn(8000,11), columns=list("ABCDEFGHIJK"))
label = np.array([randint(1,11) for mynumber in range(0, 8000)])
label[label <= 9] = 0
label[label >= 10] = 1
print(label)
df['label'] = label
# extract labels and drop them from the DataFrame
y = df['label'].values
colsToDrop = ['label']
xdf = df.drop(colsToDrop, axis=1)
x_train, x_test, y_train, y_test = train_test_split(xdf, y, train_size=0.7, test_size=0.3, random_state=1776)
# make a test pipeline
"""test_pipeline = make_pipeline(
make_union(VotingClassifier([("est", DecisionTreeClassifier(criterion="gini", max_depth=10, min_samples_leaf=8, min_samples_split=13))]), FunctionTransformer(copy)),
XGBClassifier(learning_rate=0.01, max_depth=2, min_child_weight=9, nthread=1, subsample=0.1)
)"""
test_pipeline = make_pipeline(
SelectFromModel(estimator=ExtraTreesClassifier(max_features=0.2), threshold=0.30000000000000004),
XGBClassifier(learning_rate=0.1, max_depth=1, min_child_weight=13, nthread=1, subsample=0.6000000000000001)
)
# # Fix random state when the operator allows (optional) just for get consistent CV score in TPOT
tpot = TPOTClassifier()
tpot._set_param_recursive(test_pipeline.steps, 'random_state', 42)
# cv scores
cvscores = cross_val_score(test_pipeline, x_train, y_train, cv=5, scoring='accuracy', verbose=0)
from xgboost.core import XGBoostError
import warnings
for i in range(2000):
try:
# cv scores
with warnings.catch_warnings():
warnings.simplefilter('ignore')
cvscores = cross_val_score(test_pipeline, x_train, y_train, cv=5, scoring='accuracy', verbose=0)
except XGBoostError:
print("Wrong")
Somehow, the error message still showed up even I used the codes above to catch XGBoostError. But program did not crash after runing this bad pipeine 2000 times.
From this part of source code in xgboost, it seems that the error message is printed out by std::ostringstream
. I am not sure if python can catch this message.
Ah that makes sense, good catch!
Would it be acceptable to just suppress any pipelines that don't meet certain conditions (in this case, not passing in any features due to no features meeting the feature-selection threshold) so they don't get scored / crossed over with other pipelines?
I see what you're saying though, it would probably be best to try to use XGBoosts built-in error checking from a maintainability perspective, right?
Thank you for these good ideas.
It is hard to tell whether the feature-selection step would remove all features before running the pipeline, and it also depends on data. We will refine parameters in selectors (#423) to prevent this issue.
In my codes above, I tried XGBoosts built-in error XGBoostError
in their python wrapper but it still printed out the error message even the program is still running. I think std::ostringstream
in xgboost C++ source codes printed out error in stdout. It is very strange.
Do we know why this issue occurs ? it will be helpful to know why "colsample_bytree=1 is too small that no feature can be included" occurs.
The reason is that feature-selection step in a pipeline can exclude all features before running xgboost. We need better control on feature numbers within pipeline.
I have the same issue.
> devtools::session_info()
Session info --------------------------------------------------------------------------------------
setting value
version R version 3.4.0 (2017-04-21)
system x86_64, mingw32
ui RStudio (1.0.153)
language (EN)
collate French_France.1252
tz Europe/Paris
date 2017-08-24
Packages ------------------------------------------------------------------------------------------
package * version date source
acepack 1.4.1 2016-10-29 CRAN (R 3.4.1)
backports 1.1.0 2017-05-22 CRAN (R 3.4.0)
base * 3.4.0 2017-04-21 local
base64enc 0.1-3 2015-07-28 CRAN (R 3.4.0)
bigmemory 4.5.19 2016-03-28 CRAN (R 3.4.1)
bigmemory.sri 0.1.3 2014-08-18 CRAN (R 3.4.0)
bigsnpr * 0.1.0.9001 2017-08-24 local
bigstatsr * 0.1.0.9002 2017-08-24 local
checkmate 1.8.3 2017-07-03 CRAN (R 3.4.1)
cluster 2.0.6 2017-03-10 CRAN (R 3.4.0)
codetools 0.2-15 2016-10-05 CRAN (R 3.4.0)
colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0)
compiler 3.4.0 2017-04-21 local
crayon 1.3.2.9000 2017-07-22 Github (gaborcsardi/crayon@750190f)
data.table 1.10.4 2017-02-01 CRAN (R 3.4.0)
datasets * 3.4.0 2017-04-21 local
devtools 1.13.3 2017-08-02 CRAN (R 3.4.1)
digest 0.6.12 2017-01-27 CRAN (R 3.4.0)
foreach * 1.4.3 2015-10-13 CRAN (R 3.4.0)
foreign 0.8-67 2016-09-13 CRAN (R 3.4.0)
Formula * 1.2-2 2017-07-10 CRAN (R 3.4.1)
ggplot2 * 2.2.1.9000 2017-07-23 Github (hadley/ggplot2@331977e)
graphics * 3.4.0 2017-04-21 local
grDevices * 3.4.0 2017-04-21 local
grid 3.4.0 2017-04-21 local
gridExtra 2.2.1 2016-02-29 CRAN (R 3.4.0)
gtable 0.2.0 2016-02-26 CRAN (R 3.4.0)
Hmisc * 4.0-3 2017-05-02 CRAN (R 3.4.1)
htmlTable 1.9 2017-01-26 CRAN (R 3.4.1)
htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
htmlwidgets 0.9 2017-07-10 CRAN (R 3.4.1)
iterators 1.0.8 2015-10-13 CRAN (R 3.4.0)
knitr 1.17 2017-08-10 CRAN (R 3.4.1)
lattice * 0.20-35 2017-03-25 CRAN (R 3.4.0)
latticeExtra 0.6-28 2016-02-09 CRAN (R 3.4.1)
lazyeval 0.2.0 2016-06-12 CRAN (R 3.4.0)
magrittr * 1.5 2014-11-22 CRAN (R 3.4.0)
Matrix * 1.2-9 2017-03-14 CRAN (R 3.4.0)
memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
methods * 3.4.0 2017-04-21 local
munsell 0.4.3 2016-02-13 CRAN (R 3.4.0)
nnet 7.3-12 2016-02-02 CRAN (R 3.4.0)
parallel 3.4.0 2017-04-21 local
plyr 1.8.4 2016-06-08 CRAN (R 3.4.0)
R6 2.2.2 2017-06-17 CRAN (R 3.4.1)
RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.4.0)
Rcpp 0.12.12 2017-07-15 CRAN (R 3.4.1)
rlang 0.1.2 2017-08-09 CRAN (R 3.4.1)
rpart 4.1-11 2017-03-13 CRAN (R 3.4.0)
rstudioapi 0.6 2016-06-27 CRAN (R 3.4.0)
scales 0.4.1.9002 2017-07-23 Github (hadley/scales@6db7b6f)
splines 3.4.0 2017-04-21 local
stats * 3.4.0 2017-04-21 local
stringi 1.1.5 2017-04-07 CRAN (R 3.4.0)
stringr 1.2.0 2017-02-18 CRAN (R 3.4.0)
survival * 2.41-3 2017-04-04 CRAN (R 3.4.0)
testthat * 1.0.2 2016-04-23 CRAN (R 3.4.0)
tibble 1.3.4 2017-08-22 CRAN (R 3.4.0)
tools 3.4.0 2017-04-21 local
utils * 3.4.0 2017-04-21 local
withr 2.0.0 2017-07-28 CRAN (R 3.4.1)
xgboost 0.6-4 2017-01-05 CRAN (R 3.4.0)
I've seen some traffic on these issues regarding potentially getting rid of xgboost altogether due to dependency troubles, so if that is the case then this isn't relevant.
I am receiving the following error message:
I know that the colsample_bytree parameter should be what proportion of the features you're allowing each tree to randomly sample from in order to build itself. So a colsample_bytree=1 should be telling each tree to sample from 100% of the columns/features when building a tree. (Please correct me if I'm wrong on that!)
xgboost colsample_bytree = subsample ratio of columns when constructing each tree.
This has also been previously raised as an issue on xgboost's github repo, but that issue was closed without really any explanation of what the user was doing wrong.
My guess is that this would be an error with what parameters are being passed into XGBoost and not necessarily an xgboost issue.
Context of the issue
My environment:
Process to reproduce the issue
This is my simple script to reproduce the error in my environment with random data. This error doesn't tend to occur when my
generations
andpopulation_size
are low (around 10-15 each). I have experienced this issue with generation/population_size as low as 32 (with this same script below). Hopefully this short script is sufficiently reproducible!I couldn't find any prior issues that addressed this specific error I keep running into, but I apologize if I may have missed one.