OCR-D / spec

Specification of the @OCR-D technical architecture, interface definitions and data exchange format(s)
https://ocr-d.de/en/spec/
17 stars 5 forks source link

Use "block" instead of "region" throughout #135

Closed kba closed 4 years ago

kba commented 4 years ago

ALTO tries to be interoperable with IIIF as discussed here. There is a "Text Granularity Extension" for IIIF that defines what we call "levels":

page A page in a paginated document
block An arbitrary region of text
paragraph A paragraph
line A topographic line
word A single word
glyph A single glyph or symbol

Seems reasonably compatible with our definitions, though we call line TextLine and have no distinct notion of a paragraph.

My point is: We do use region instead of block in a few places, such as some executables ocrd-*-region. Should we decide on a common parameter for level, it would be a moment to make sure we're consistent.

bertsky commented 4 years ago

I concur, we should use the next opportunity to use the term block instead of region more consistently. (Our METS file group USE classes already use BLOCK, but we are already discussing of relaxing that scheme.)

Incidentally, this hierarchy is also identical with Tesseract's RIL (ResultIterator levels).

But I don't think we are entirely incompatible with a paragraph level either. PAGE-XML defines all region types fully recursively, and designates @type="paragraph" for paragraphs. IIUC the current OCR-D spec and implementation is agnostic about whether regions should be used in a flat or multi-level fashion. In some places though, PAGE-XML already requires at least 2 levels (namely table cells and footnotes, perhaps also text blocks comprising a @type="drop-capital" and a @type="paragraph" region). These are also the very places we have not tackled at all with the current toolset yet. Our GT however mostly already uses 2 levels for that – and rightly so, because this is most versatile (it can still be reduced to a flat regime, but can also be used for ANN segmentation training, for which a non-flat representation is the only way to cleanly separate visual from textual cues).

So I would propose also taking that opportunity to decide in favour of a mildly recursive region representation of 2 levels, with both a block level and an explicit paragraph level in the functional model. This would allow e.g. ocrd-tesserocr-segment to operate on 3 distinct output levels:

  1. block segmentation from page to blocks (of any type),
  2. paragraph segmentation from text blocks to paragraphs and from table blocks to table cells (as a prerequisite for further representation),
  3. line segmentation from paragraphs to text lines.
bertsky commented 4 years ago

Okay, so there was consensus in the VC that:

cneud commented 4 years ago

With b199c62b314b9b4b8279811029bf7fc2c47b87da, can we release this next week?

kba commented 4 years ago

Yes and also merge https://github.com/OCR-D/assets/pull/73

kba commented 4 years ago

Released and assets adapted.