evaluate: explain/document metrics

Yes, that paper lent the idea for the oversegmentation and undersegmentation measures – but only these two (not the others), and I took the liberty to deviate from the exact definition of Zhang et al. 2021: https://github.com/OCR-D/ocrd_segment/blob/81923495648c346a84436fb7d08727d9c13eb88d/ocrd_segment/evaluate.py#L440-L444

So in my implementation these measures are merely raw ratios, i.e. the share of regions in GT and DT which have been oversegmented (or undersegmented, resp.).

My notion of a match is somewhat arbitrary, but IMO more adequate than averaging over different IoU thresholds for various confidence thresholds:

A pair of true vs predicted region is a true positive (TP), iff
- its IoU is ≥ 50% or
- its IoGT is ≥ 50% or
- its IoDT is ≥ 50%.
A prediction which is not matched is a false positive (FP).
A ground truth which is not matched is a false negative (FN).

(All area values under consideration are numbers of pixels in the polygon-masked segments, not just bounding box sizes.)

So in all, you get the following metrics here:

area measures
- IoU: intersection over union, i.e. the share of the overlapping area of a match over the union of the true and the predicted region
- IoGT: intersection over ground truth, i.e. the share of the overlapping area of a match over the total area of the true region
- IoDT: intersection over detection, i.e. the share of the overlapping area of a match over the total area of the predicted region
- pixel-recall: page-wise aggregate of intersection over GT including missed true regions (FN), i.e. the share of the overlapping areas over the total area of true regions in a page
- pixel-precision: page-wise aggregate of intersection over DT including fake predicted regions (FP), i.e. the share of the overlapping areas over the total area of predicted regions in a page
segment measures
- oversegmentation: share of true and predicted regions which have been oversegmented (i.e. where true regions match multiple detections) over all regions
- undersegmentation: share of true and predicted regions which have been undersegmented (i.e. where predicted regions match multiple ground truths) over all regions
- recall: ratio of matches (TP) over true regions, i.e. share of correctly predicted regions in total GT
- precision: ratio of matches (TP) over detected regions, i.e. share of correctly predicted regions in total DT

For each metric, there is a page-wise (or even segment-wise) and an aggregated measure; the latter always uses micro-averaging over all (matching pairs in all) pages.

OCR-D / ocrd_segment

evaluate: explain/document metrics #57