dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.29k stars 8.73k forks source link

A error message is printed out with "silent=True" in XGBClassifier #2349

Closed weixuanfu closed 6 years ago

weixuanfu commented 7 years ago

silent parameter in xgboost's python api fails to prevent XGBClassifier from printing out error message. It happens if no feature is left by feature selection steps before using XGBClassifier. But it could happen in grid search or other optimization methods for tuning parameter.

Environment info

Operating System:

macOS 10.1.2.5

Compiler:

gcc-6 (Homebrew GCC 6.3.0_1 --without-multilib) 6.3.0

Package used (python/R/jvm/C++):

xgboost version used: 0.60 and 0.60a2

If you are using python package, please provide

  1. The python version and distribution Python 3.6.0 |Anaconda custom (x86_64)| and Python 3.5.3 |Continuum Analytics, Inc.| (default, Mar 6 2017, 12:15:08)

Steps to reproduce

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from xgboost.core import XGBoostError
import warnings

X, y = make_classification(n_samples=1000, n_features=20,
                                    n_informative=2, n_redundant=10,
                                    random_state=42)
x_train, x_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=42)

# Use 0 feature in X empty
X_empty = np.empty(shape=(len(y_train), 0))

clf = XGBClassifier(silent=True)
try:
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        clf.fit(X_empty, y_train)
except XGBoostError:
    print('XGBoostError is caught')

Or

print("importing modules...")
import pandas as pd
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from xgboost import XGBClassifier
from xgboost.core import XGBoostError
from random import randint
import warnings

# I wanted the label data to be a bit imbalanced
print("creating fake data...")
np.random.seed(1776)
df = pd.DataFrame(np.random.randn(8000,11), columns=list("ABCDEFGHIJK"))
label = np.array([randint(1,11) for mynumber in range(0, 8000)])
label[label <= 9] = 0
label[label >= 10] = 1
print(label)
df['label'] = label

# extract labels and drop them from the DataFrame
y = df['label'].values
colsToDrop = ['label']
xdf = df.drop(colsToDrop, axis=1)

x_train, x_test, y_train, y_test = train_test_split(xdf, y, train_size=0.7, test_size=0.3, random_state=1776)

test_pipeline = make_pipeline(
    SelectFromModel(estimator=ExtraTreesClassifier(max_features=0.2, random_state=42), threshold=0.3),
    XGBClassifier(learning_rate=0.1, max_depth=1, min_child_weight=13, nthread=1, subsample=0.6, silent=True)
    )

try:
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        test_pipeline.fit(x_train, y_train)
except XGBoostError:
    print('XGBoostError is caught')

Output:


[15:59:14] dmlc-core/include/dmlc/logging.h:300: [15:59:14] src/tree/updater_colmaker.cc:162: Check failed: n > 0U (0 vs. 0) colsample_bytree=1 is too small that no feature can be included```
ghost commented 7 years ago

There is an option DMLC_LOG_BEFORE_THROW set in include/dmlc/base.h. This is used in /dmlc-core/include/dmlc/logging.h to determine whether to write to stderr in LogMessageFatal. The easiest option is to set this to zero to prevent this behaviour.

In order for the silent option in python to suppress these messages, we need to find some way to pass it into the logging init function, possibly then setting glog's FLAGS_logtostderr = 0.