DanilZherebtsov / verstack

MIT License
92 stars 10 forks source link

ThreshTuner¶ #24

Closed chrissny88 closed 1 year ago

chrissny88 commented 1 year ago

Hello Danil,

Can you please give more clarifications on how to use ThreshTuner or maybe direct me to a working example.

from the documentation ,apart of setting the min and max thereshold,metric function,....I don't see where to put the algorithm or the data used

thresh = ThreshTuner(n_thresholds = 500, min_threshold = 0.2, max_threshold = 0.6)
thresh.fit(labels, pred, f1_score)
DanilZherebtsov commented 1 year ago

The algorithm itself is not necessary. Instead you need the predicted probabilities generated by the trained algorithm and the labels. Say you have trained and predicted by some algorithm:

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)
pred = model.predict_proba(X_test)[:,1]

Now you can use the ThreshTuner to get the best threshold for your desired loss function, e.g. f1_score

from verstack import ThreshTuner
from sklearn.metrics import f1_score
tuner = ThreshTuner()
tuner.fit(y_test, pred, loss_func = f1_score)

If no loss function is provided it will use sklearn.metrics.balanced_accuracy_score

return all the threshold tuning results in a pd.DataFrame tuner.result

get only the best score based on your loss function and the associated threshold tuner.best_score() tuner.best_score()['threshold']

Return the best threshold&score while preserving the distribution of classes. This is important because in highly unablanced target distributions you might get a perfect score (low error) with a certain threshold, but it will not help predict the minority class. E.g. you have 99% of zeros and 1% of ones. You can tune the threshold to achieve 99% accuracy, but it will not capture any of the actual minority class. So the ThreshTuner provides a method to return a threshold and the associated score while preserving the actual target classes distribution

tuner.best_predict_ratio()
>>>         threshold  f1_score  fraction_of_1
>>>    66   0.454298  0.975806        0.41196  # <- is the fraction of predicted ones using this threshold

And you can check the actual fraction of ones in the ground truth labels just to doublecheck tuner.labels_fraction_of_1

chrissny88 commented 1 year ago

I am using the below function to check the changes in metrics but it keeps giving me the same results regardless of the given threshold:

# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target,threshold=0.5):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # checking which probabilities are greater than threshold
    pred_temp = model.predict(predictors) > threshold
    # rounding off the above values to get classes
    pred = np.round(pred_temp)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score
    AUC = roc_auc_score(target, pred)  # to compute AUC-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1,
            "AUC":AUC

        },
        index=[0],
    )

    return df_perf

# Calculating different metrics on validation set
LogisticRegression_random_val = model_performance_classification_sklearn(
    LogisticRegression_tuned, X_test, y_test,threshold=0.139
)
print("Validation performance:")
LogisticRegression_random_val

Is it my function or it comes from ThreshTuner?

DanilZherebtsov commented 1 year ago

okay, in this line pred_temp = model.predict(predictors) > threshold you should be predicting probabilities. If you check the outcome of the model.predict(predictors), you will see classes ([0,0,...,1,0,1]) but you need probabilities of the positive class which you will be then splitting to the predicted classes based on the threshold.

model.predict_proba(predictors)[:,1] will give you the predicted probabilities of class 1

So modify this line as follows: pred_temp = model.predict_proba(predictors)[:,1] > threshold and check again

This is basically in part replicating what ThreshTuner does. You can use the ThreshTuner to achieve the same goals.

DanilZherebtsov commented 1 year ago

AUC = roc_auc_score(target, pred) # to compute AUC-score

One more thing. You should not be measuring AUC on the predicted classes. Instead feed the predicted_probabilities to the AUC calculation function. P.S. the outcome does not depend on any threshold. AUC measures the overall model's predictive power given all the possible thresholds.

chrissny88 commented 1 year ago

Thank you for the clarifications. If I got it well then AUC should be calculated as AUC = roc_auc_score(target, model.predict_proba(predictors)[:,1])?

DanilZherebtsov commented 1 year ago

Right. You might want to save your predictions into a variable first, so that you could use it later. But it really makes not difference for the metric calculation.

DanilZherebtsov commented 1 year ago

Did you get your function working properly?

chrissny88 commented 1 year ago

Yes the function works now perfectly well. now thinking how to write a function to display the confusion matrix with a given threshold. Again thank you..I really appreciate your help

I was wondering ,do Verstack have tools that can help on ANN and CNN? As I find it very hard to tune neural network

DanilZherebtsov commented 1 year ago

In the neural network scope keras is the go-to package to help build a neural network without a lot of pain. There is really nothing to add to this beautiful library... It is a bit overwhelming as there are a lot of methods/modes/settings, but that is definitely much more user friendly than creating a NN with plain tensorflow. So I would advice Andrew Ng Deep Learning Specialisation courses that will get you up and running with NN.

chrissny88 commented 1 year ago

Noted. In case one have imbalanced data, would you advise to adjust the threshold or apply undersampling and oversampling techniques? or we try all and what works best?

DanilZherebtsov commented 1 year ago

This depends on the actual real world data distribution. If the real world production data's nature is a skewed balance and the model will be trained on the perfect balance - this may result in poor generalisation. So the answer is a sophisticated model/ensemble of models and threshold tuning.

On the contrary if real life data is somewhat balanced and your train data for some reason is unbalanced, then of course you should balance your training data.

chrissny88 commented 1 year ago

Does ThreshTuner allow to set a number we want to achieve for our loss function and gives us the threshold corresponding to it? for example asking it to give me a threshold that will achieve 95% recall

DanilZherebtsov commented 1 year ago

You would either have to study the tuner.results dataframe and select the appropriate threshold corresponding the the metric value next to it, or you can use tuner.best_score() to grab the threshold that yields the best score measured by the passed loss function

chrissny88 commented 1 year ago

Thank you for the insightful discussion

chrissny88 commented 1 year ago
%%time

tuner = LGBMTuner(metric = 'recall',seed=999,trials=2000)
tuner.fit(X_train_over, y_train_over)

pred = tuner.predict_proba(X_test)[:,1]
tuner1 = ThreshTuner()
tuner1.fit(y_test, pred, loss_func = recall_score)

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Input In [304], in <cell line: 1>()
----> 1 pred = tuner.predict_proba(X_test)[:,1]
      2 tuner = ThreshTuner()
      3 tuner.fit(y_test, pred, loss_func = recall_score)

IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

Can you please explain the above error and how to solve it

DanilZherebtsov commented 1 year ago
%%time

tuner = LGBMTuner(metric = 'recall',seed=999,trials=2000)
tuner.fit(X_train_over, y_train_over)

pred = tuner.predict_proba(X_test)[:,1]
tuner1 = ThreshTuner()
tuner1.fit(y_test, pred, loss_func = recall_score)

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Input In [304], in <cell line: 1>()
----> 1 pred = tuner.predict_proba(X_test)[:,1]
      2 tuner = ThreshTuner()
      3 tuner.fit(y_test, pred, loss_func = recall_score)

IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

Can you please explain the above error and how to solve it

Hi, this looks like you are passing a different version of features to the predict function. That is: the X_train_over that you used at .fit is different from the X_test that you use at predict.