OCR-D / ocrd_segment

OCR-D-compliant page segmentation
MIT License
66 stars 15 forks source link

Repair fix coords #43

Closed bertsky closed 4 years ago

bertsky commented 4 years ago

This attempts to fix problems caused by invalid polygons from ocrd-segment-repair (both in sanitize and plausibilize mode).

This taught me another lesson about what can go wrong with Shapely / numpy / PAGE interaction. To sum up:

  1. Ensuring valid polygons on the input side (e.g. from OpenCV) is always necessary. The only generic way I can think of is to feed them through simplify with ever increasing tolerance until valid. EDIT2 The problem is that the result of the algorithm implemented in Shapely/GEOS depends on the starting point it picked. In pathological cases, no simplification whatsoever can be achieved. (The only thing that then helps is re-ordering...)
  2. Operations like union or intersection can create collections of shapes. EDIT There are actually 2 cases here:
    1. homogeneous (MultiPolygon) – a discontiguous collection of Polygon – in which case one needs the convex hull.
    2. heterogeneous (GeometryCollection) – a collection of Polygon with Point or LineString – in which case one needs to filter out those shapes which have no intrinsic area (and then check again for the other cases)
  3. Operations like union or intersection can create non-integer points, which when rounded for PAGE serialization can become invalid paths. Unfortunately, Shapely always calculates in floating point internally. So all we can do is rounding and then ensuring validity (as in 1).

Related:

stweil commented 4 years ago

@bertsky, you are faster with pushing new commits to existing pull requests than I am able to test them.

I suggest to wait a little with new commits until the existing pull request was merged and make a new one then (unless there is an urgent need for the new commit of course).

stweil commented 4 years ago

@kba (or whoever has merge rights), can this PR be merged? I'd like to have a working ocrd_all again, and this seems to be one required component.

bertsky commented 4 years ago

@bertsky, you are faster with pushing new commits to existing pull requests than I am able to test them.

I suggest to wait a little with new commits until the existing pull request was merged and make a new one then (unless there is an urgent need for the new commit of course).

Sorry, there are so many related issues at the moment and the shared problem is always our Shapely code. Also, I don't have the time right now to go forward and backward on each one of them. So I felt like rushing it – this time.

@kba (or whoever has merge rights), can this PR be merged? I'd like to have a working ocrd_all again, and this seems to be one required component.

Right now I am still testing myself. Also, after delegating to the PAGE validator in core, I found another bug there...