[Question] Negative patients impact on test FROC score...

Thibescobar commented 1 month ago

:question: Question

Maybe related to Project-MONAI/tutorials#1582

Hello,

I observe that when adding negative patients (with no lesion to detect) for test or cross-validation, the FROC score is always decreased. When I ignore these patients for the test or cross-validation of the same trained model, the score is higher...

I tryed to explain myself this by the fact that adding negative patients can only add false positives (FP) and nothing else (no true negative in detection), not permitting to increase the sensitivity (Se), biasing it toward low values for fixed FP/scan. But a colleague challenged this explanation saying the following :

For example if we have a score of Se 80% @ 2FP/scan in average, if we add negative healthy patients, Se will remain the same, and we could expect 2FP/scan in average still...

Do you have elements of answer, or elements of explanation of this phenomenum? Did you observe this also?

Another side question: The X-axis (FP/scan) is computed at the sample level then averaged, but the Y-axis ? at the lesion level aggregated across samples, or in average as well? Maybe be it could help uderstanding?

Thank you very much.

Best, Thibault

mibaumgartner commented 1 month ago

Dear Thibault,

regarding your side question: both values first compute TP and FP across the entire dataset (nothing is computed per sample). The TP are than "normalized" by the number of ground truth objects (i.e. classical sensitivity on object level) and the FP are normalized by the number of images.

The behaviour when adding negative patients will change depending on the type of problem you are looking at and what kind of negative patient images are added.

Some thoughts: 1) "... can only add false positives (FP) and nothing else (no true negative in detection), not permitting to increase the sensitivity (Se), biasing it toward low values for fixed FP/scan." The general thought is correct, Sensitivity won't increase but the False Positives are computed as an average across the images -> adding negative images will add False Positives but simultaneously also increase the allowed total number of FP in the dataset (example: 5 images with 10 FP total across all images => 2FP/Image ; 10 images with 20 FP total across all images => still 2FP/Image) 2) The exact behaviour will change based on your detection problem. I observed quite often (but not always) that networks are able to differentiate healthy and sick patients quite effectively producing little FP's in them -> this would increase your FROC score, since you increase the total amount of allowed FPs in the dataset without adding FPs from the detector. (generally indicated by problems where the AP is rather low but FROC scores are high) 3) Your thoughts are correct for AP -> AP only has TP and FP on object level and thus AP can only decrease when more patients are added.

Highly recommend the metrics reloaded and metrics pitfalls papers to build intuition for these things.

In the end, the best way to look into this problem more closely is to look into your images and the predictions :)

Thibescobar commented 1 month ago

Dear Michael, thank you for your complete answer.

regarding your side question: both values first compute TP and FP across the entire dataset (nothing is computed per sample). The TP are than "normalized" by the number of ground truth objects (i.e. classical sensitivity on object level) and the FP are normalized by the number of images.

Ok, thank you for the precision, it corresponds to what I thought.

The behaviour when adding negative patients will change depending on the type of problem you are looking at and what kind of negative patient images are added.

Some thoughts:

"... can only add false positives (FP) and nothing else (no true negative in detection), not permitting to increase the sensitivity (Se), biasing it toward low values for fixed FP/scan." The general thought is correct, Sensitivity won't increase but the False Positives are computed as an average across the images -> adding negative images will add False Positives but simultaneously also increase the allowed total number of FP in the dataset (example: 5 images with 10 FP total across all images => 2FP/Image ; 10 images with 20 FP total across all images => still 2FP/Image)

The exact behaviour will change based on your detection problem. I observed quite often (but not always) that networks are able to differentiate healthy and sick patients quite effectively producing little FP's in them -> this would increase your FROC score, since you increase the total amount of allowed FPs in the dataset without adding FPs from the detector. (generally indicated by problems where the AP is rather low but FROC scores are high)

Indeed, you are totally right actually:

My intuition was strong, but I couldn't come up with any good arguments to defend it, so I did a little experiment with subgroups.

I made subgroups by varying the proportion of healthy patients, and in a more general way (less binary or discontinuous), by varying the average number of lesions per scan in the subgroups. Always testing the same model on these different groups.

The results are clear:

the more lesions, the lower the FROC score (consistent with what you mention in points 1. and 2.),

the less lesions, the lower the AP score (consistent with your point 3.).

So my intuition was not only incorrect, but the opposite of reality... How can I explain this fallacious reasoning? I think it's due to the fact that of all the differences between LIDC and LUNA (part of it), the one that seemed the most obvious/trivial, the least “subtle”, to me, was the absence of healthy patients in LUNA. As I observed a difference in scores, I attributed it to that. The fact that apourchot had the same intuition here reinforced mine, and what's more, I obtained better results than LIDC in cross-validation on a proprietary external test dataset with no healthy patients... Well, intuition can definitively be a trap!

General thoughts:

FROC can be highly volatile in relation to non-obvious confounders, while still being the primary measure of detection. Do you agree?

This makes me think that in detection, it's even more crucial to compare models and methods on the same data sets in paired fashion, compared to segmentation or classification, which can still be tricky but much easier to catch innately. Do you agree?

Anyway, Thank you very much for all your answers every time! Really appreciate it!

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 1 day ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

MIC-DKFZ / nnDetection

[Question] Negative patients impact on test FROC score... #272

:question: Question