cerlymarco / shap-hypetune

A python package for simultaneous Hyperparameters Tuning and Features Selection for Gradient Boosting Models.
MIT License
567 stars 71 forks source link

Eval Metric directionality? #21

Closed ericvoots closed 1 year ago

ericvoots commented 1 year ago

Hi,

If I use a custom metric like the brier score where lower is better, does this package support looking to minimize the eval metric? or is it by default trying to maximize?

Thank You

cerlymarco commented 1 year ago

Hi,

you are looking for greater_is_better param

    greater_is_better : bool, default=False
        Effective only when hyperparameters searching.
        Whether the quantity to monitor is a score function,
        meaning high is good, or a loss function, meaning low is good.

all the best

ericvoots commented 1 year ago

hmm I keep getting an error using Brier Score Loss (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.brier_score_loss.html).

I was able to get it working with the AUC metric fine.

Here is the error and the function:

ValueError: y_prob contains values less than 0.

def BRS(y_hat, dtrain): y_true = dtrain.get_label() return 'brs', brier_score_loss(y_true, y_hat)

I checked the data and there good mixture of both 1 and 0's and nothing else.

cerlymarco commented 1 year ago

your boosting model is simply predicting negative values.

ericvoots commented 1 year ago

When I checked it directly from the model object, all the probabilities were above 0. I also ran into issues using the balanced accuracy measure. Only AUC seems to work.

cerlymarco commented 1 year ago

This is a dummy working example which works fine... I hope u can find it helpful.

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import brier_score_loss

from shaphypetune import BoostRFE

from lightgbm import *

X, y = make_classification(n_samples=6000, n_features=20, n_classes=2, 
                                   n_informative=4, n_redundant=6, random_state=0)

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, shuffle=False)

def BRIER(y_true, y_hat):
    return 'brier', brier_score_loss(y_true, y_hat, pos_label=1), False

param_grid = {
    'learning_rate': [0.2, 0.1],
    'num_leaves': [25, 35],
    'max_depth': [10, 12]
}

model = BoostRFE(
    LGBMClassifier(n_estimators=150, random_state=0, metric="custom"), 
    param_grid=param_grid, min_features_to_select=1, step=1,
    greater_is_better=False
)
model.fit(
    X_train, y_train, 
    eval_set=[(X_valid, y_valid)], early_stopping_rounds=6, verbose=1, 
    eval_metric=BRIER
)

All the best

ericvoots commented 1 year ago

So the BoostRFE can be used fine with the classification models? On most of the examples here it showed BoostRFE with the regression models?

https://github.com/cerlymarco/shap-hypetune/blob/main/notebooks/XGBoost_usage.ipynb

cerlymarco commented 1 year ago

All the estimators available in shap-hypetune can be used for classification and regression with both xgboost or lgbm

ericvoots commented 1 year ago

Ah got you. Okay I'm still getting errors on the brier score but also got this error on balanced accuracy:

raise ValueError("Classification metrics can't handle a mix of {0} " ValueError: Classification metrics can't handle a mix of binary and continuous target

Both in the original DB and the dataframe for the target, created all values are 0 and 1.

The regular clf_xgb fits fine and can do both Brier & Balanced Accuracy without issue, but the code crashes on the BoostRFE model (also Boruta too) on the '.fit' step. Here is the code:

clf_xgb = XGBClassifier(n_estimators=2000,
                        random_state=0,
                        verbosity=3,
                        n_jobs=-1,
                        scale_pos_weight=1,
                        use_label_encoder=False,
                        objective='binary:logistic',
                        eval_set=[(cv_x, cv_y)])

    clf_xgb.fit(train_x, train_y)

class_pred = clf_xgb.predict(train_x)

balanced_accuracy = balanced_accuracy_score(class_pred, train_y)

brier_score = brier_score_loss(class_pred, train_y)

print(brier_score)

print(balanced_accuracy)

model = BoostRFE(clf_xgb, param_grid=param_dist, min_features_to_select=1, step=1, n_iter=8, sampling_seed=0)

model.fit(train_x, train_y, eval_set=[(cv_x, cv_y)], early_stopping_rounds=6, verbose=100,eval_metric=ACC)
print(model.estimator_, model.best_params_, model.best_score_, model.n_features_)

print(f"feature ranking {model.ranking_}")

model_ranking_list = list(model.ranking_)

print(model_ranking_list)
cerlymarco commented 1 year ago

it seems you are not using eval_metric=ACC in regular clf_xgb

Pay attention! I think that you are passing to balanced_accuracy_score probabilities (continuous values) instead of predicted classes/targets.

ericvoots commented 1 year ago

I was using the balanced accuracy directly with the following and no crashes:

balanced_accuracy = balanced_accuracy_score(class_pred, train_y)

and when printing the score out. Even when I modify clf_xgb to use the custom Accuracy function as so there are no errors:

clf_xgb = XGBClassifier(n_estimators=2000,
                        random_state=0,
                        verbosity=3,
                        n_jobs=-1,
                        scale_pos_weight=1,
                        use_label_encoder=False,
                        objective='binary:logistic',
                        eval_set=[(cv_x, cv_y)],
                        eval_metric=ACC)

and I'm able to print the both the balanced accuracy score (0.984741888307878) and brier score (0.02292) to console.

image

cerlymarco commented 1 year ago

This is a dummy working example which works fine... I hope u can find it helpful.

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import balanced_accuracy_score

from shaphypetune import BoostRFE

from xgboost import *

X, y = make_classification(n_samples=6000, n_features=20, n_classes=2, 
                                   n_informative=4, n_redundant=6, random_state=0)

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, shuffle=False)

def ACC(y_pred, dtrain):
    y_true = dtrain.get_label()
    y_pred = (y_pred > 0.5).astype(int)
    err = 1 - balanced_accuracy_score(y_true, y_pred)
    return 'bal_acc', err

param_grid = {
    'learning_rate': [0.2, 0.1],
    'num_leaves': [25, 35],
    'max_depth': [10, 12]
}

model = BoostRFE(
    XGBClassifier(n_estimators=150, random_state=0, metric="custom"), 
    param_grid=param_grid, min_features_to_select=1, step=1,
    greater_is_better=False
)
model.fit(
    X_train, y_train, 
    eval_set=[(X_valid, y_valid)], early_stopping_rounds=6, verbose=1, 
    eval_metric=ACC
)

sincerely this is the best I can do... all the best. bie