OCR-D / page-to-alto

Convert PAGE (v. 2019) to ALTO (v. 2.0 - 4.2)
Apache License 2.0
13 stars 5 forks source link

readingDirection and textLineOrder #2

Open bertsky opened 3 years ago

bertsky commented 3 years ago

Not sure if there's any equivalent for that in ALTO. Glyphs are supposed to be ordered by XML order. Spec does not say anything about words and lines though. IDNEXT is only for region-level ReadingOrder I guess (which you already have in the TODO).

But see https://github.com/PRImA-Research-Lab/PAGE-XML/issues/26 for correct interpretation on the PAGE side.

kba commented 3 years ago

IDNEXT is only for region-level ReadingOrder I guess (which you already have in the TODO).

It is also (trivially) implemented:

https://github.com/kba/page-to-alto/blob/46a8cc2fb74ce327e9d195f1095699cbae946cce/ocrd_page_to_alto/convert.py#L75-L78

bertsky commented 3 years ago

IDNEXT is only for region-level ReadingOrder I guess (which you already have in the TODO).

It is also (trivially) implemented:

Oh, right! So you can already set the check in the Readme, no?

Just curious: why depth=1 and not full recursion here and in convert_text? You can still make the hierarchy flat on the ALTO side, but not traversing recursively on the PAGE side will lose information. (And I would recommend being recursive on both sides BTW.)

kba commented 3 years ago

why depth=1 and not full recursion here

Copy-Pasta, the reading order conversion should indeed be recursive.

and in convert_text

My idea was that convert_text should have an outer non-recursive loop and then have a region/block-specific inner loop. What I want to avoid is breaking table regions. What regions other than TableRegion can typically contain recursive regions?

bertsky commented 3 years ago

What regions other than TableRegion can typically contain recursive regions?

All regions can embed all other region types. As for typical cases, I don't know really. I guess that besides the pattern table→text, which is mandatory, the obvious text→text should be pretty pervasive due to cases like block→(heading|paragraph) or block→(drop-capital|paragraph) or block→(list-label|paragraph). Then there's of course image/graphics→text due to the caption relation. But one could think of many combinations, depending on the complexity of the layout and necessecity of representation...

I think we should try to be as general and agnostic as is possible.

M3ssman commented 3 years ago

On the ALTO-side, one may express regions-in-regions as ComponentBlock elements. These are subtypes of ALTO-block types, alike TextBlock , Illustration or GraphicalBlock (layout elements of any sort) . A ComposedBlock might be annotated with @TYPE to show if it represent a table, a column or advertisement or any other, user defined text class. This is an important use case for announcement/advertisement newspaper pages.

bertsky commented 2 years ago

BTW, on the line level, besides TextRegion/@textLineOrder we'd have to adhere to TextLine/@index ordering – both should influence the ALTO TextLine element ordering (but I don't know how the two attributes relate).

bertsky commented 2 years ago

BTW, on the line level, besides TextRegion/@textLineOrder we'd have to adhere to TextLine/@index ordering – both should influence the ALTO TextLine element ordering (but I don't know how the two attributes relate).

Perhaps we should simply make this configurable as in #27 for regions: --textline-order [document|index|textline-order] (but I wouldn't know how to implement the latter option, due to the aforementioned ambiguity in the semantics of that attribute).