`tesserocr-recognize` acts differently when segmenting alone versus segmenting+recognising

sven-nm commented 3 years ago

It looks like tesserocr-recognize yields significantly different results when used : 1) for segmentation and recognition in a single step and 2) for segmentation in one step and recognition in another.

For instance, the two commands below do not yield the same results (the second being much more prone to errors):

# Segmentation and recognition in one step
tesserocr-recognize -I OCR-D-BIN -O OCR-D-SEG -P segmentation_level region -P textequiv_level word -P model eng+grc

# Segmentation step
tesserocr-recognize -I OCR-D-BIN -O OCR-D-REG -P segmentation_level region -P textequiv_level word
# Recognition step
tesserocr-recognize -I OCR-D-REG -O OCR-D-OCR -P segmentation_level none -P model eng+grc

Even so the docs mention here and here that setting segmentation_level to none prevents tesserocr-recognize from resegmenting, @bertsky suggested in this thread that tesserocr-recognize would resegment words anyway. In order to avoid confusion, this should maybe be explicitely mentioned in the docs.

EEngl52 commented 3 years ago

https://github.com/OCR-D/ocrd-website/pull/216

bertsky commented 3 years ago

Even so the docs mention here and here that setting segmentation_level to none prevents tesserocr-recognize from resegmenting, @bertsky suggested in this thread that tesserocr-recognize would resegment words anyway. In order to avoid confusion, this should maybe be explicitely mentioned in the docs.

I'm afraid there was a misunderstanding here. Sorry if I'm late to the discussion and even more so if my initial answer in the chat was confusing.

Anyway, here's the point: the reason that all-in-one workflow does not yield the same results like this segment-then-recognize workflow is not that the latter's recognition step resegments anything (above the glyph level), but rather that it does not. In other words, because it is tied to the word coordinates from the rule-based (Omnifont) segmenter, it cannot reach the same accuracy as the all-in-one workflow, which in contrast is allowed to override the word segmentation via LSTM results.

You should get more similar results between all-in-one and modularized workflows if you picked the line level as point of delivery. But then still results won't be exactly the same, because:

all-in-one: Tesseract's internal data structures will pass non-overlapping (polygonal) lines to the LSTMs
modular: the ResultIterator will only give bounding boxes, which may still overlap (but you can try to use -P shrink_polygons true against that)

OCR-D / ocrd-website

`tesserocr-recognize` acts differently when segmenting alone versus segmenting+recognising #215