[Question] understanding evaluation metrics

EvelyneCalista commented 1 year ago

:question: Question

Dear @mibaumgartner ,

I have questions regarding interpret the results:

Which folders are containing the final result that you used in the paper? As I see there are _valanalysis, _val_analysispreprocessed, also sweep. I am not quite sure the differences between these three folders.
My label consists of 2 classes, in the confusion matrix the classes show 0,1,2. Does the class 0 means the background? I read through the code, however I cannot find any answer.
in val_analysis folder it shows that the model was evaluated with different combination of IoU and "score threshold". What is the "score threshold" mean, since as my understanding the threshold is IoU itself?

Thank you.

Best, Evelyne

mibaumgartner commented 1 year ago

1) From the readme:

The final model directory will contain multiple subfolders with different information:

sweep: contain information from the parameter sweeps and are only used for debugging purposes
sweep_predictions: these contain prediction with additional ensembler state information which are used during the empirical parameter optimization. Since these save the model output in a fairly raw format they are bigger than the predictions seen during normal inference to avoid multiple model prediction runs during the parameter sweeps
[val/test]_predictions: Contains the prediction of the validation/test set in the restored image space.
val_predictions_preprocessed: This contains prediction in the preprocessed image space, i.e. the predictions from the resampled and cropped data. they are saved for debugging purposes.
[val/test]_results: this folder contains the validation/test rsults computed by nnDetection. More information on the metrics can be found below.
val_results_preprocessed: contains validation results inside the preprocessed image space are saved for debugging purposes
val_analysis[_preprocessed] experimental: provide additional analysis information of the predictions. This feature is marked as expeirmental since it uses a simplified matching algorithm and should only be used to gain an intuition of potential improvements.
The following section contains some additional information regarding the metrics which are computed by nnDetection. They can be found in [val/test]_results/results_boxes.json:

AP_IoU_0.10_MaxDet_100: is the main metric used for the evaluation in our paper. It is evaluated at an IoU threshold of 0.1 and 100 predictions per image. Note that this is a hard limit and if images contain much more instances this leads to wrong results.
mAP_IoU_0.10_0.50_0.05_MaxDet_100: Is the typically found COCO mAP metric evaluated at multiple IoU values. The IoU thresholds are different from those of the COCO evaluation to account for the generally lower IoU in 3D data
[num]_AP_IoU_0.10_MaxDet_100: AP metric computed per class
FROC_score_IoU_0.10 FROC score with default FPPI (1/8, 1/4, 1/2, 1, 2, 4, 8). Note (in contrast to the AP implementation): the multi-class case does not compute the metric per class but puts all predictions/gt into a single large pool (similar to AP_pool from https://arxiv.org/abs/2102.01066) and thus inter class calibration is important here. In most cases simply averaging the [num]_FROC scores manually to assign the same weight to each class should be prefered.
case evaluation experimental: It is possible to run case evaluations with nnDetection but this is still experimental and undergoing additional testing and might be changed in the future.

=> val_/test_results contains the metrics reported in the paper

2) Yes 0 is background.

3) The analysis folder is intended as a tool for visual analysis and shouldn't be used for anything else. That is why it contains an IoU and score threshold => i.e. looking for False Positive predictions requires thresholding via a score

EvelyneCalista commented 1 year ago

Dear @mibaumgartner ,

Thank you for your answer. May I ask, why you choose IoU 0.1 as a main metric for the evaluation? Since I think using IoU>0.5 might be more accurate to detect the lesion location. or in medical image field, IoU 0.1 is already good enough for the system's performance?

mibaumgartner commented 1 year ago

Citing from the original Retina U-Net paper (https://arxiv.org/pdf/1811.08661.pdf): "Experiments are evaluated using mean average precision (mAP). We determine mAP at a relatively low matching intersection over union (IoU) threshold of IoU = 0.1. This choice respects the clinical need for coarse localization and also exploits the non-overlapping nature of objects in 3D"

0.5 is the typical values used in the natural image processing domain (2D): we can not simply employ their value since the IoU decreases cubically in 3D which is much quicker than in 2D. Achieving an IoU of 0.5 is already really difficult to achieve for some 3D detection tasks (also depends on the object size, datasets with large objects are easier than datasets with tiny obejcts) and visually extremely good. Given that we are usually interested in diagnostic tasks we used the coarser value. Other challenges used values in the 0.1-0.3 range.

EvelyneCalista commented 1 year ago

I am sorry, actually i am bit confuse about the "non-overlapping nature of objects in 3D". So the non overlapping here is refer to the decreasing IoU value in 3D objects, which means by having IoU 0.1-0.3 doesn't mean that it only overlap a bit tiny part of it but it actually still catch the rough area of the lesions. does my understanding is correct?

I also have another question during the training, did you have encounter overfitting situation when train using nnDetection?

Thank you very much. I am really appreciate your explanation.

mibaumgartner commented 1 year ago

Regarding the "non-overlapping nature of objects in 3D": In natural images, objects/bounding boxes may overlap each other, in order to make sure that the algorithm detected the correct objects (and didn't mix them up) a higher IoU threshold is needed. Since objects are not overlapping in 3D, even neighbouring objects have small IoU values to each other => i.e. the threshold can be lower while still ensuring that the correct object was predicted.

While 0.1 sounds like "overlap a bit tiny part", the visual assessment of the IoU is actually quite different. Especially for small objects, even small deviations from the ground truth already drop the IoU significantly.

I have seen overfitting of nnDetection on some datasets, especially with limited data and very hard problems (e.g. lesions in prostate MRI are really hard). We still observed quite good results though and decreasing the number of epochs didn't always help with overfitting.

EvelyneCalista commented 1 year ago

Thank you for your explanation, it is really helpful for me. I have tried the visualization of each bounding box mask for IoU 0.1 for each slice of the image, it is really detect the location that I expected to be.

Do you think the total number of data also contribute to the overfitting for this task? I am curious about what kind of techniques that can be tried other than decreasing the number of epoch?

mibaumgartner commented 1 year ago

Usually, overfitting is caused by limited dataset size which is very typical in the medical setting. More high quality data is always a great way to improve performance since deep learning models are very data hungry. Nevertheless, usually we try to get as much performance as possible from the data that is currently available: reducing the number of epochs is a good starting point, usually the second thing I try is an adaptation of the augmentations to better reflect the dataset (e.g. elastic deformation, more/less rotation, adding resampling artefacts etc.). Tuning the augmentation is a bit difficult since there is a balance between the IO and CPU resources. More general techniques to reduce overfitting did not work super well for me in the past (e.g. increasing weight decay, dropout).

github-actions[bot] commented 9 months ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 8 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

MIC-DKFZ / nnDetection

[Question] understanding evaluation metrics #152

:question: Question