dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.27k stars 8.73k forks source link

How to select features based on feature importance using SelectFromModel? #2944

Closed shahlaebrahimi closed 6 years ago

shahlaebrahimi commented 6 years ago

I would appreciate if you could let me know how to select features based on feature importance using SelectFromModel. I wrote:

# data
X = np.array(pd.read_csv('who_X_1.csv',header=None).values)
y = np.array(pd.read_csv('who_Y_1.csv',header=None).values.ravel())
indices = np.arange(y.shape[0])

# # Divide Data into Train and Test
X_train, X_test, y_train, y_test,idx_train,idx_test = train_test_split(X, yy,indices,stratify=yy,test_size=0.3, random_state=42)

scaler = StandardScaler()

# # Compute Cohen's Kappa or Auc as scoring criterion due to imbalanced data set
kappa_scorer = make_scorer(cohen_kappa_score)
auc_scorer=make_scorer(roc_auc_score)
F_measure_scorer = make_scorer(f1_score)

##hyperparameter

param_grid = {
    'clf__colsample_bytree': [i/10.0 for i in range(7,10)],
    #"clf__subsample"  : [i/10.0 for i in range(5,10)],
    #'clf__max_depth':range(5,15,1),
    #'clf__gamma':[i/10.0 for i in range(0,5)],
    #'clf__reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]

            }

##Classifier
xg=XGBClassifier(max_depth=3,
                 learning_rate=0.05,
                 n_estimators=350,
                 objective="binary:logistic",
                 booster="gbtree",
                 gamma=0,
                 min_child_weight=0.8,
                 subsample=1,
                 colsample_bylevel=1,
                 colsample_bytree=0.6,
                 reg_alpha=0.001,
                 reg_lambda=1,
                 scale_pos_weight=22,
                 random_state=4,n_jobs=-1)
pipe=Pipeline(steps=[('pre',scaler),
                    ('clf',xg)])

rg_cv = GridSearchCV(pipe, param_grid, cv=5, scoring = 'f1')
rg_cv.fit(X_train, y_train)
print("Tuned rf best params: {}".format(rg_cv.best_params_))

# Use SelectFromModel
thresholds = np.sort(rg_cv.best_estimator_.named_steps["clf"].feature_importances_)
for thresh in thresholds:
    # select features using threshold
    selection = SelectFromModel(rg_cv, threshold=thresh, prefit=True)
    select_X_train = selection.transform(X_train)

    # train model
    selection_model = rg_cv
    selection_model.fit(select_X_train, y_train)

    # eval model
    select_X_test = selection.transform(X_test)
    y_pred = selection_model.predict(select_X_test)
    predictions = [round(value) for value in y_pred]
    accuracy = accuracy_score(y_test, predictions)
    print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy * 100.0))
    print(confusion_matrix(y_test, predictions))
    print(classification_report(y_test, predictions))

However, the following error occurred:

ValueError: The underlying estimator GridSearchCV has no `coef_` or `feature_importances_` attribute. Either pass a fitted estimator to SelectFromModel or call fit before calling transform. Thanks

pommedeterresautee commented 6 years ago

This seems to be a Scikit error. I am closing the issue.