Research: Evaluation Metric + Function

bjzim commented 11 months ago

Accuracy: This is the ratio of correctly predicted instances to the total instances. It's a good general metric when classes are balanced, but it can be misleading if there's a class imbalance.

Confusion Matrix: Provides a detailed breakdown of true positive, true negative, false positive, and false negative predictions for each class. It's a starting point for many other metrics.

Precision, Recall, and F1-Score:

Precision (Positive Predictive Value): The ratio of correctly predicted positive observations to the total predicted positives. It's important when the cost of false positives is high. Recall (Sensitivity or True Positive Rate): The ratio of correctly predicted positive observations to all actual positives. It's crucial when the cost of false negatives is high, for example, missing a malignant skin lesion. F1-Score: The harmonic mean of Precision and Recall. It's useful when you want a balance between Precision and Recall. Area Under the ROC Curve (AUC-ROC): Represents the capability of the model to distinguish between the classes. An AUC close to 1 indicates good class separability. It's especially useful for binary classification problems.

Area Under the Precision-Recall Curve (AUC-PR): Especially useful when classes are imbalanced. It focuses on the performance with respect to the positive (minority) class.

Matthews Correlation Coefficient (MCC): It's a balanced metric that works well even when classes are of very different sizes.

Cohen's Kappa: Measures the agreement between predicted and observed categorizations while correcting for chance. It's especially useful when classes are imbalanced.

Specific Considerations for Skin Disease Classification:

Class Imbalance: Skin disease datasets might have many examples of common conditions and fewer examples of rare conditions. In such cases, metrics like accuracy can be misleading, and precision, recall, AUC-PR, or MCC might be more appropriate.

Clinical Consequences: Missing a malignant lesion can have serious health implications, so recall (sensitivity) might be particularly important. Similarly, false positives might lead to unnecessary biopsies or treatments, making precision vital.

Multiclass Problems: If there are more than two skin conditions, you might need micro and macro averages for precision, recall, and F1-score.

In conclusion, I think the best evaluation metric often depends on the specific goals and constraints of the skin disease classification task. It seems to be common to consider multiple metrics to get a comprehensive understanding of a model's performance.

bjzim commented 11 months ago

Macro Average: Calculation: For each class, calculate the metric independently and then take the average over all classes. Interpretation: Macro averaging treats all classes equally, regardless of their size (number of instances). It gives equal weight to each class's metric. Use Cases: Useful when you want to understand the performance of the model for each class, especially when the classes are imbalanced.

Micro Average: Calculation: Aggregate the contributions of all classes to compute the metric. Essentially, it pools together the true positives, false positives, true negatives, and false negatives of all classes and calculates the metric on the pooled counts. Interpretation: Micro averaging is dominated by the larger classes. It provides a performance metric based on the total number of correct and incorrect predictions. Use Cases: Useful when you want a performance metric that reflects the global performance across all instances.

Da-MaRo commented 11 months ago

Amazing research on evaluation metrics Björn! I would have a couple of questions on for you.

Most publications of image classifications on skin lesions trained on accuracy and discussed mostly top-1 accuracy, top-3 accuracy and comparisons in accuracy to dermatologists. Should we have a similar approach?

We can definitely not "dump" the other metrics and keep an eye on how our model performs on sensitivity for example. How would we implement that in our workflow?

Thoughts on the following quote?

"For example, unlike the binary case where precision and recall alone may not be the best option for performance evaluation, macro and weighted precision and recall scores by themselves can be good choices for multiclass classification"

Should we put our focus here?

StefanK2ff / capstone-healthy-skin

Research: Evaluation Metric + Function #44

Specific Considerations for Skin Disease Classification: