PRImA-Research-Lab / PAGE-XML

PAGE XML format collection for document image page content and more
Apache License 2.0
63 stars 8 forks source link

add semantics to coordinate system #13

Closed bertsky closed 5 years ago

bertsky commented 5 years ago

Coordinates are at the heart of stand-off annotation formats. In PAGE-XML, all visible elements must have a CoordsType, which must have a @points. There is even some syntax for that enforced by a regular expression. However, the standard lacks any semantics for the coordinate system whatsoever. There is not even a comment about this, so with luck, at least all implementors guessed consistently.

IMO we need to specify that:

  1. @points always describes (a list of x-y pairs of) absolute pixel coordinates ("absolute" meaning they refer to the root image in PageType/@imageFilename with the upper left corner as 0,0)

Moreover, we should clarify whether:

  1. @points has a topology of
    • (unordered) sets of points, or
    • a single (open or closed) path, or
    • multiple closed paths (and if so, whether orientation is relevant as in e.g. left=inside / right=outside)
  2. @points must obey certain constraints like
    • are paths allowed to leave the parent element's polygon outline / bounding box, or maybe even the page's bounding box (i.e. become negative, which is currently forbidden by syntax)? And if not:
    • must they be closed along the parent element's polygon outline / bounding box, or may they stay open when intersecting it?
    • are paths required to be planar (i.e. have no cross-sections)? And if not:
    • how does the content area compute,
      • by union, or
      • by difference, or
      • by orientation (left-of-path or right-of-path)?

This is highly relevant for implementors, especially when polygon processing and AlternativeImage processing on multiple hierarchy levels in the presence of skew becomes common practise – which is currently happening within OCR-D (for showcases see our Tesseract and our Ocropy preprocessing and segmentation wrappers).

(Cf. altoxml/schema#49)

chris1010010 commented 5 years ago

Good points ;-)

  1. Agreed
    • What would an unordered set of points represent?
    • At the moment, it's intended as single path (closed in case of regions etc. open in case of baseline)
    • In our understanding (although never specified in the format except the non-negative check) the paths should stay inside the page / parent object and they should be non-self-overlapping. Obviously that can't be enforced in XML, but in Aletheia we use higher-level validation to check for such things. If paths self-overlap we convert to a union shape I think. We don't crop polygons if they are outside their parent, but there are tools for that in Aletheia
bertsky commented 5 years ago
  • Agreed

Splendid! Would you like me to do a PR?

* What would an unordered set of points represent?

I don't know. It just seemed like the minimal option. As in: "no interpretation is guaranteed, help yourself!" Or in having no specification at all. Implementors could try to always compute the outer hull, or try their luck with path interpretations...

* At the moment, it's intended as single path (closed in case of regions etc. open in case of baseline)

Ok, fair enough. (Closed by description – at least one pair must repeat – or closed by convention – the first pair is meant to be repeated?)

But what about cases where the region is non-contiguous, because e.g. a TextRegion gets flowed over by a ImageRegion, or a TextLine by a GraphicRegion? In that case, only having a single path necessitates including the intruders, so the only way to get rid of them for further processing (layout / dewarping / recognition) would be to offer a AlternativeImage where they get clipped to white. See here for example images on this approach.

* In our understanding (although never specified in the format except the non-negative check) the paths should stay inside the page / parent object and they should be non-self-overlapping.

I was hoping you say so. (But I do get non-planar polygons from Tesseract sometimes, and some contour libraries never bother to close their paths.)

* Obviously that can't be enforced in XML, but in Aletheia we use higher-level validation to check for such things.

Yes, as with most of the semantics, this would be a matter of some non-XSD validation. In OCR-D, we are planning to write one using geometry heuristics. Is there some place I can look at the respective rules in Aletheia?

* If paths self-overlap we convert to a union shape I think.

That is also what Tesseract does itself (if asked to return a raw image of blocks from layout analysis). But if this is forbidden by the schema, it's totally up to the processor/library trying to produce PAGE to handle this case if it does arise internally. (Maybe it's still worth commenting on in the schema, though.)

chris1010010 commented 5 years ago

Closed by convention (first pair repeats). Yes, "intruders" are a problem, but simplicity was favored over being able to cover all use cases (this was a decision by the creators). PAGE was never intended for pixel-accurate description. There's a list in the Aletheia user guide (page 118).

bertsky commented 5 years ago

Thanks a lot for all the clarification! I hope the PR meets your approval.