DearCaat / MHIM-MIL

[ICCV 2023 Oral] Multiple Instance Learning Framework with Masked Hard Instance Mining for Whole Slide Image Classification
59 stars 3 forks source link

A suggestion regarding the performance evaluation #13

Closed wwyi1828 closed 6 months ago

wwyi1828 commented 6 months ago

First of all, great job on the project!

I was going through the code and had a thought about the evaluation metric generation in utils.py (https://github.com/DearCaat/MHIM-MIL/blob/a411faabd7732c9c59fff7201ded205f05362d93/utils.py#L97). I noticed that the optimal threshold search is performed directly on the test set in the main file.

If this function is directly used on the test set for optimal threshold search, it causes information leakage. If a threshold is to be used, the threshold searched from the validation set should be applied to the test set, rather than searching on the test set and then using the test set-based threshold. This practice itself violates the basic principles of ML. It may inflate the accuracy and metrics like the F1 score for all baselines. Although this may not affect the comparison, as you used the same approach for all methods, it does introduce the risk of inflating the performance (non-AUC metrics) of all models.

This is just a minor suggestion to make the evaluation process even more stringent. I hope this suggestion is helpful. Thanks!

DearCaat commented 6 months ago

Many thanks to Wu for this good suggestion, I have added a relevant arg --best_thr_val to the repository. This parameter allows the model to use the optimal threshold of the validation set to validate the non-AUC performance on the test set.

Other than that, I have a little bit of my own opinion on the subject of validation sets. In my opinion, because the amount of data related to computational pathology is small compared to the traditional CV domain, I don't see a particular need to keep validation sets on some datasets. Validation sets with very small amounts of data sometimes hinder a fairer test during the experiment instead. Borrowing from survival prediction which uses 5 folds cross-validation without a validation set, might be a better solution.

wwyi1828 commented 6 months ago

Thank you for your quick response and for addressing this issue promptly. It's great that you have already incorporated the --best_thr_val parameter to search for the optimal threshold on the validation set.

I completely agree with your thoughts on cross-validation and small data sets. Cross-validation is indeed an effective approach to simulating real-world scenarios, ensuring that there is no information leakage between folds.

Best wishes for your future endeavors!

DearCaat commented 6 months ago

Thank you for your quick response and for addressing this issue promptly. It's great that you have already incorporated the --best_thr_val parameter to search for the optimal threshold on the validation set.

I completely agree with your thoughts on cross-validation and small data sets. Cross-validation is indeed an effective approach to simulating real-world scenarios, ensuring that there is no information leakage between folds.

Best wishes for your future endeavors!

Best wish :thumbsup:

wwyi1828 commented 6 months ago

I just wanted to add a brief follow-up to my previous comment to clarify a few points regarding the use of validation sets and cross-validation in the context of the CAMELYON16 dataset.

In C16, when val_ratio==0, there's no separate test set, which can lead to overly optimistic estimates since the same data acts as both the test and validation sets. While this setup might be suitable for hyperparameter tuning, it's not ideal for evaluating model performance on unseen data. When working with small datasets and aiming to use all data for both training and testing, nested cross-validation is a more appropriate approach. It helps prevent information leakage and provides a more realistic estimate of the model's performance on truly unseen data. Ideally, all metrics should be reported on an independent test set that the model hasn't seen during training or validation.

I hope this clarifies my earlier comment. While the C16 setting might not affect the comparative performance between models, it's an aspect to take note of to avoid information leakage, as it tends to inflate the performance metrics for C16 across all methods, including the baselines. I appreciate your efforts in addressing this issue, and I hope this additional perspective can be helpful. Good luck!