Problem with inference - Githubissues

ratkhohieu commented 3 years ago

Hello, I have a question about the threshold which you used. First, I try to calculate thresholds where precision is equal to recall. After that, results from model will compare with it to become array [1, 0, ...0]. If I continue with it and use the function "average_precision_score", I will get another meanAP. Did I do any wrong steps?

Tandon-A commented 3 years ago

@ratkhohieu Hello,

Thank you for showing interest in my work.

The mean AP (computed over the test set) is a metric to assess the trained model's performance and compare it with other models. During testing, we first calculate the model outputs over the test set and then pass the ground truth labels along with the model outputs to the average_precision_score function. At inference time, we use the calculated thresholds to convert the model output over an image to get the final categorical predictions. The thresholds calculation part and the conversion from confidence scores to categorical predictions is as you have mentioned. At inference time, technically, you wouldn't have access to the ground truth labels and wouldn't be able to call the average_precision_score.

I hope this clarifies your doubt.

Regards, Abhishek

ratkhohieu commented 3 years ago

Hello, Thank you for replying to me.

Sorry, but it is not my point. I mean, I try to predict the final label (using threshold) of the test set ([1, 0, ..] form) and compute the mean AP. It is different from my result when I use probabilities from out of the model.

Probabilities->threshold->mAP
Probabilities->mAP

I don't know why it is different. average_precision_score has noted that "This implementation is not interpolated and is different from computing the area under the precision-recall curve with the trapezoidal rule". So maybe calculate thresholds, where precision is equal to recall will make different results?

Tandon-A commented 3 years ago

Hello,

Sorry, I missed your point.

Probabilities->threshold->mAP
Probabilities->mAP

In the above two points, the first method of calculating the AP is wrong. The average precision method needs non-thresholded values for calculating the metic. "Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions"Sklearn source The only 'correct' AP is obtained from the second method.

The average precision method summarizes the Precision-Recall curve. As described on the Sklearn webpage, AP is like the weighted average of precisions at the thresholds where the weight is the increase in the recall. The AP function defines a set of thresholds and then for each threshold it calculates the precision and recall value and then calculates the metric.

The average_precision_score method works like this (Pseudocode):

def AP(labels, model_output_scores):
       define a set of thresholds
       precisions = [], recalls = []
       for threshold in thresholds:
            predictions = convert_scores_to_preds(scores, threshold)
            precision = precision_score(labels, predicitons)
            recall = recall_score(labels, predicitons)
            precisions.append(precision), recalls.append(recall)
       precisions.append(1), recall.append(0) # Added so that the graph starts at the y-axis. 
       AP = getAP(precisions, recalls) # calls the equation to find AP

When you are passing the model confidence scores to the function (Option 2), it calculates the recall values and precision values at different thresholds and then gets this weighted average.

But when you pass the thresholded predictions to the function (Option 1), the recall and the precision values remain the same at each threshold. Since the predictions are already of type [0,0, 1, ...], changing the threshold doesn't change them, resulting in the same number of false positives, true positives, and false negatives each time, resulting in the same recall and precision values. This results in incorrect calculation of the AP metric.

I hope this clarifies your doubt.

Regards, Abhishek

ratkhohieu commented 3 years ago

Thank you for the very detailed explanation. It's my fault for not reading the metric carefully. I got this point.

Tandon-A / emotic

Problem with inference #4