OCR-D / ocrd_segment

OCR-D-compliant page segmentation
MIT License
67 stars 15 forks source link

repair: inaccurate coordinates (tiny inconsistency/invalidity) #32

Closed bertsky closed 2 years ago

bertsky commented 4 years ago

From discussion on OCR-D/core#418:

Additionally, IMO the coordinate checks should be made a little less strict (and thus more compatible with Aletheia) to avoid crying wolf.

Things I see frequently:

  1. very small (up to 1 pixel) violations of non-containment in parent element
    • Shapely does not have almost_within, but one could try containment within the dilated version:
      if not (child_poly.within(node_poly) or
            child_poly.within(node_poly.buffer(0.5)))
  2. tiny (direct neighbour) self-intersections because of back-and-forth (probably caused by internal rounding)
    • This must be repaired on the spot, otherwise Shapely will not operate on these polygons. Possibly:
      if not node_poly.is_valid:
        if node_poly.simplify(0.8).is_valid:
            node_poly = node_poly.simplify(0.8)

But it could be more prudent to keep a strict validator, and outsource these repairs into a dedicated Aletheia postprocessor (e.g. ocrd-segment-repair with a new correct-coords=true).

Originally posted by @bertsky in https://github.com/OCR-D/core/pull/418#issuecomment-576920368

bertsky commented 4 years ago

Status update: since https://github.com/OCR-D/core/pull/442/commits/6bf98d0350a6fa383fe5ab6592f470f917748c0b we do have a slightly more tolerant validator, but this is not much help, because

bertsky commented 2 years ago

This has been solved long ago – ocrd-segment-repair does fix (trivial) validation errors automatically