Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
I would appreciate if you could let me know how to select features based on feature importance using SelectFromModel. I wrote:
# data
X = np.array(pd.read_csv('who_X_1.csv',header=None).values)
y = np.array(pd.read_csv('who_Y_1.csv',header=None).values.ravel())
indices = np.arange(y.shape[0])
# # Divide Data into Train and Test
X_train, X_test, y_train, y_test,idx_train,idx_test = train_test_split(X, yy,indices,stratify=yy,test_size=0.3, random_state=42)
scaler = StandardScaler()
# # Compute Cohen's Kappa or Auc as scoring criterion due to imbalanced data set
kappa_scorer = make_scorer(cohen_kappa_score)
auc_scorer=make_scorer(roc_auc_score)
F_measure_scorer = make_scorer(f1_score)
##hyperparameter
param_grid = {
'clf__colsample_bytree': [i/10.0 for i in range(7,10)],
#"clf__subsample" : [i/10.0 for i in range(5,10)],
#'clf__max_depth':range(5,15,1),
#'clf__gamma':[i/10.0 for i in range(0,5)],
#'clf__reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]
}
##Classifier
xg=XGBClassifier(max_depth=3,
learning_rate=0.05,
n_estimators=350,
objective="binary:logistic",
booster="gbtree",
gamma=0,
min_child_weight=0.8,
subsample=1,
colsample_bylevel=1,
colsample_bytree=0.6,
reg_alpha=0.001,
reg_lambda=1,
scale_pos_weight=22,
random_state=4,n_jobs=-1)
pipe=Pipeline(steps=[('pre',scaler),
('clf',xg)])
rg_cv = GridSearchCV(pipe, param_grid, cv=5, scoring = 'f1')
rg_cv.fit(X_train, y_train)
print("Tuned rf best params: {}".format(rg_cv.best_params_))
# Use SelectFromModel
thresholds = np.sort(rg_cv.best_estimator_.named_steps["clf"].feature_importances_)
for thresh in thresholds:
# select features using threshold
selection = SelectFromModel(rg_cv, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = rg_cv
selection_model.fit(select_X_train, y_train)
# eval model
select_X_test = selection.transform(X_test)
y_pred = selection_model.predict(select_X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy * 100.0))
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
However, the following error occurred:
ValueError: The underlying estimator GridSearchCV has no `coef_` or `feature_importances_` attribute. Either pass a fitted estimator to SelectFromModel or call fit before calling transform.
Thanks
I would appreciate if you could let me know how to select features based on feature importance using SelectFromModel. I wrote:
However, the following error occurred:
ValueError: The underlying estimator GridSearchCV has no `coef_` or `feature_importances_` attribute. Either pass a fitted estimator to SelectFromModel or call fit before calling transform.
Thanks