Closed chrissny88 closed 1 year ago
The algorithm itself is not necessary. Instead you need the predicted probabilities generated by the trained algorithm and the labels. Say you have trained and predicted by some algorithm:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)
pred = model.predict_proba(X_test)[:,1]
Now you can use the ThreshTuner to get the best threshold for your desired loss function, e.g. f1_score
from verstack import ThreshTuner
from sklearn.metrics import f1_score
tuner = ThreshTuner()
tuner.fit(y_test, pred, loss_func = f1_score)
If no loss function is provided it will use sklearn.metrics.balanced_accuracy_score
return all the threshold tuning results in a pd.DataFrame
tuner.result
get only the best score based on your loss function and the associated threshold
tuner.best_score()
tuner.best_score()['threshold']
Return the best threshold&score while preserving the distribution of classes. This is important because in highly unablanced target distributions you might get a perfect score (low error) with a certain threshold, but it will not help predict the minority class. E.g. you have 99% of zeros and 1% of ones. You can tune the threshold to achieve 99% accuracy, but it will not capture any of the actual minority class. So the ThreshTuner provides a method to return a threshold and the associated score while preserving the actual target classes distribution
tuner.best_predict_ratio()
>>> threshold f1_score fraction_of_1
>>> 66 0.454298 0.975806 0.41196 # <- is the fraction of predicted ones using this threshold
And you can check the actual fraction of ones in the ground truth labels just to doublecheck
tuner.labels_fraction_of_1
I am using the below function to check the changes in metrics but it keeps giving me the same results regardless of the given threshold:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target,threshold=0.5):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# checking which probabilities are greater than threshold
pred_temp = model.predict(predictors) > threshold
# rounding off the above values to get classes
pred = np.round(pred_temp)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
AUC = roc_auc_score(target, pred) # to compute AUC-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
"AUC":AUC
},
index=[0],
)
return df_perf
# Calculating different metrics on validation set
LogisticRegression_random_val = model_performance_classification_sklearn(
LogisticRegression_tuned, X_test, y_test,threshold=0.139
)
print("Validation performance:")
LogisticRegression_random_val
Is it my function or it comes from ThreshTuner?
okay, in this line pred_temp = model.predict(predictors) > threshold
you should be predicting probabilities.
If you check the outcome of the model.predict(predictors)
, you will see classes ([0,0,...,1,0,1]) but you need probabilities of the positive class which you will be then splitting to the predicted classes based on the threshold.
model.predict_proba(predictors)[:,1]
will give you the predicted probabilities of class 1
So modify this line as follows: pred_temp = model.predict_proba(predictors)[:,1] > threshold
and check again
This is basically in part replicating what ThreshTuner
does. You can use the ThreshTuner
to achieve the same goals.
AUC = roc_auc_score(target, pred) # to compute AUC-score
One more thing. You should not be measuring AUC on the predicted classes. Instead feed the predicted_probabilities to the AUC calculation function. P.S. the outcome does not depend on any threshold. AUC measures the overall model's predictive power given all the possible thresholds.
Thank you for the clarifications. If I got it well then AUC should be calculated as AUC = roc_auc_score(target, model.predict_proba(predictors)[:,1])?
Right. You might want to save your predictions into a variable first, so that you could use it later. But it really makes not difference for the metric calculation.
Did you get your function working properly?
Yes the function works now perfectly well. now thinking how to write a function to display the confusion matrix with a given threshold. Again thank you..I really appreciate your help
I was wondering ,do Verstack have tools that can help on ANN and CNN? As I find it very hard to tune neural network
In the neural network scope keras
is the go-to package to help build a neural network without a lot of pain. There is really nothing to add to this beautiful library... It is a bit overwhelming as there are a lot of methods/modes/settings, but that is definitely much more user friendly than creating a NN with plain tensorflow. So I would advice Andrew Ng Deep Learning Specialisation courses that will get you up and running with NN.
Noted. In case one have imbalanced data, would you advise to adjust the threshold or apply undersampling and oversampling techniques? or we try all and what works best?
This depends on the actual real world data distribution. If the real world production data's nature is a skewed balance and the model will be trained on the perfect balance - this may result in poor generalisation. So the answer is a sophisticated model/ensemble of models and threshold tuning.
On the contrary if real life data is somewhat balanced and your train data for some reason is unbalanced, then of course you should balance your training data.
Does ThreshTuner allow to set a number we want to achieve for our loss function and gives us the threshold corresponding to it? for example asking it to give me a threshold that will achieve 95% recall
You would either have to study the tuner.results
dataframe and select the appropriate threshold corresponding the the metric value next to it, or you can use tuner.best_score()
to grab the threshold that yields the best score measured by the passed loss function
Thank you for the insightful discussion
%%time
tuner = LGBMTuner(metric = 'recall',seed=999,trials=2000)
tuner.fit(X_train_over, y_train_over)
pred = tuner.predict_proba(X_test)[:,1]
tuner1 = ThreshTuner()
tuner1.fit(y_test, pred, loss_func = recall_score)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Input In [304], in <cell line: 1>()
----> 1 pred = tuner.predict_proba(X_test)[:,1]
2 tuner = ThreshTuner()
3 tuner.fit(y_test, pred, loss_func = recall_score)
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
Can you please explain the above error and how to solve it
%%time tuner = LGBMTuner(metric = 'recall',seed=999,trials=2000) tuner.fit(X_train_over, y_train_over) pred = tuner.predict_proba(X_test)[:,1] tuner1 = ThreshTuner() tuner1.fit(y_test, pred, loss_func = recall_score)
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) Input In [304], in <cell line: 1>() ----> 1 pred = tuner.predict_proba(X_test)[:,1] 2 tuner = ThreshTuner() 3 tuner.fit(y_test, pred, loss_func = recall_score) IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
Can you please explain the above error and how to solve it
Hi, this looks like you are passing a different version of features to the predict function. That is: the X_train_over
that you used at .fit is different from the X_test
that you use at predict.
Hello Danil,
Can you please give more clarifications on how to use ThreshTuner or maybe direct me to a working example.
from the documentation ,apart of setting the min and max thereshold,metric function,....I don't see where to put the algorithm or the data used