Open bertsky opened 3 years ago
IDNEXT
is only for region-levelReadingOrder
I guess (which you already have in the TODO).
It is also (trivially) implemented:
IDNEXT
is only for region-levelReadingOrder
I guess (which you already have in the TODO).It is also (trivially) implemented:
Oh, right! So you can already set the check in the Readme, no?
Just curious: why depth=1
and not full recursion here and in convert_text
? You can still make the hierarchy flat on the ALTO side, but not traversing recursively on the PAGE side will lose information. (And I would recommend being recursive on both sides BTW.)
why
depth=1
and not full recursion here
Copy-Pasta, the reading order conversion should indeed be recursive.
and in
convert_text
My idea was that convert_text
should have an outer non-recursive loop and then have a region/block-specific inner loop. What I want to avoid is breaking table regions. What regions other than TableRegion
can typically contain recursive regions?
What regions other than
TableRegion
can typically contain recursive regions?
All regions can embed all other region types. As for typical cases, I don't know really. I guess that besides the pattern table→text, which is mandatory, the obvious text→text should be pretty pervasive due to cases like block→(heading|paragraph) or block→(drop-capital|paragraph) or block→(list-label|paragraph). Then there's of course image/graphics→text due to the caption relation. But one could think of many combinations, depending on the complexity of the layout and necessecity of representation...
I think we should try to be as general and agnostic as is possible.
On the ALTO-side, one may express regions-in-regions as ComponentBlock
elements. These are subtypes of ALTO-block types, alike TextBlock
, Illustration
or GraphicalBlock
(layout elements of any sort) . A ComposedBlock
might be annotated with @TYPE
to show if it represent a table, a column or advertisement or any other, user defined text class.
This is an important use case for announcement/advertisement newspaper pages.
BTW, on the line level, besides TextRegion/@textLineOrder
we'd have to adhere to TextLine/@index
ordering – both should influence the ALTO TextLine
element ordering (but I don't know how the two attributes relate).
BTW, on the line level, besides
TextRegion/@textLineOrder
we'd have to adhere toTextLine/@index
ordering – both should influence the ALTOTextLine
element ordering (but I don't know how the two attributes relate).
Perhaps we should simply make this configurable as in #27 for regions: --textline-order [document|index|textline-order]
(but I wouldn't know how to implement the latter option, due to the aforementioned ambiguity in the semantics of that attribute).
Not sure if there's any equivalent for that in ALTO. Glyphs are supposed to be ordered by XML order. Spec does not say anything about words and lines though.
IDNEXT
is only for region-levelReadingOrder
I guess (which you already have in the TODO).But see https://github.com/PRImA-Research-Lab/PAGE-XML/issues/26 for correct interpretation on the PAGE side.