Closed sven-nm closed 3 years ago
Even so the docs mention here and here that setting
segmentation_level
tonone
preventstesserocr-recognize
from resegmenting, @bertsky suggested in this thread thattesserocr-recognize
would resegment words anyway. In order to avoid confusion, this should maybe be explicitely mentioned in the docs.
I'm afraid there was a misunderstanding here. Sorry if I'm late to the discussion and even more so if my initial answer in the chat was confusing.
Anyway, here's the point: the reason that all-in-one workflow does not yield the same results like this segment-then-recognize workflow is not that the latter's recognition step resegments anything (above the glyph level), but rather that it does not. In other words, because it is tied to the word coordinates from the rule-based (Omnifont) segmenter, it cannot reach the same accuracy as the all-in-one workflow, which in contrast is allowed to override the word segmentation via LSTM results.
You should get more similar results between all-in-one and modularized workflows if you picked the line level as point of delivery. But then still results won't be exactly the same, because:
-P shrink_polygons true
against that)
It looks like
tesserocr-recognize
yields significantly different results when used : 1) for segmentation and recognition in a single step and 2) for segmentation in one step and recognition in another.For instance, the two commands below do not yield the same results (the second being much more prone to errors):
Even so the docs mention here and here that setting
segmentation_level
tonone
preventstesserocr-recognize
from resegmenting, @bertsky suggested in this thread thattesserocr-recognize
would resegment words anyway. In order to avoid confusion, this should maybe be explicitely mentioned in the docs.