Region-based training vs. Label training comparison

strasserpatrick commented 1 month ago

Hey @FabianIsensee

I have a quick understanding question on how to evaluate and compare the obtained (Dice) scores of region-based training with related work training on normal labels

Example: Regions on BraTS: (1, 2, 3), (2, 3), (3) Labels on BraTS: 1, 2, 3 As also stated in the docs for region-based training, I get the validation metrics for each regions. Literature normally uses (or reports) labels for these metrics.

Can I simply compare regions against label metrices, e.g. by mapping 3 -> (3) 2 -> (2, 3) 1 -> (1, 2, 3)

or do I have to do some re-calculation of the metrics based on the labels and the segmentation mask. I tried that which yields me very bad label dice scores compared to the regions.

Maybe you could clarify this for me :)

TaWald commented 1 week ago

As you correctly stated you can recompute the classes from the regions and calculate the corresponding metrics like that. If you do this on BraTS it is to be expected that your dice scores will be lower. The overall volume of CE is much lower than CE + Necrosis and the Necrosis is likely much easier to identify as well. Hence DSC will be better in CE + Necrosis vs CE. Same goes for the other regions. The larger the volume the easier it generally gets, because internal volume is trivial to predict (Hence Liver DSC is usually also super high, despite the boundaries being not necessarily great)

Long story short: DSC reduction is expected due to metric properties. If you don't mess up the re-conversion you can trust these metrics. But a simple visual inspection of your re-computation should be sufficient to verify you did not mess up

strasserpatrick commented 1 week ago

Okay, thank you for your clarification. As I have now seen, the validation and test set evaluates on the regions. So no conversion necessary for comparing to literature and reproducing results.

MIC-DKFZ / nnUNet

Region-based training vs. Label training comparison #2561