Closed kba closed 4 years ago
I concur, we should use the next opportunity to use the term block
instead of region
more consistently. (Our METS file group USE classes already use BLOCK
, but we are already discussing of relaxing that scheme.)
Incidentally, this hierarchy is also identical with Tesseract's RIL
(ResultIterator levels).
But I don't think we are entirely incompatible with a paragraph
level either. PAGE-XML defines all region types fully recursively, and designates @type="paragraph"
for paragraphs. IIUC the current OCR-D spec and implementation is agnostic about whether regions should be used in a flat or multi-level fashion. In some places though, PAGE-XML already requires at least 2 levels (namely table cells and footnotes, perhaps also text blocks comprising a @type="drop-capital"
and a @type="paragraph"
region). These are also the very places we have not tackled at all with the current toolset yet. Our GT however mostly already uses 2 levels for that – and rightly so, because this is most versatile (it can still be reduced to a flat regime, but can also be used for ANN segmentation training, for which a non-flat representation is the only way to cleanly separate visual from textual cues).
So I would propose also taking that opportunity to decide in favour of a mildly recursive region representation of 2 levels, with both a block
level and an explicit paragraph
level in the functional model. This would allow e.g. ocrd-tesserocr-segment
to operate on 3 distinct output levels:
Okay, so there was consensus in the VC that:
region
is the better term than block
, because
With b199c62b314b9b4b8279811029bf7fc2c47b87da, can we release this next week?
Yes and also merge https://github.com/OCR-D/assets/pull/73
Released and assets adapted.
ALTO tries to be interoperable with IIIF as discussed here. There is a "Text Granularity Extension" for IIIF that defines what we call "levels":
Seems reasonably compatible with our definitions, though we call
line
TextLine
and have no distinct notion of a paragraph.My point is: We do use
region
instead ofblock
in a few places, such as some executablesocrd-*-region
. Should we decide on a common parameter for level, it would be a moment to make sure we're consistent.