Closed bertsky closed 4 years ago
expose this as a (thoroghly described) parameter?
☝️ This, with the first three paragraphs of your issue as the documentation.
Seriously: how long may description
be? Is multi-line allowed?
I'm afraid PSM_RAW_LINE
is still buggy in Tesseract: It sometimes crashes in Tesseract::recog_all_words
because there is an attempt to DeleteCurrentWord()
when nothing is there to delete. It appears to depend on the model, too: I have not seen this happen with models from tessdata/_best, but it happens all the time with models from tesstrain.
I will investigate (and hopefully fix) in Tesseract...
That's typically caused by models which have different unicharsets for legacy and LSTM recognizer when Tesseract uses the wrong unicharset.
I fixed some of those cases in the past.
Is there already an issue for the crash?
That's typically caused by models which have different unicharsets for legacy and LSTM recognizer when Tesseract uses the wrong unicharset.
I see. Thanks for the hint!
That's typically caused by models which have different unicharsets for legacy and LSTM recognizer when Tesseract uses the wrong unicharset.
I see. Thanks for the hint!
Yes, we're on the right track here: This only happens when there are more than 2 models loaded which exhibit the Failed to load any lstm-specific dictionaries for lang
warning (like GT4HistOCR). Only 1 model is always fine, as are multiple LSTM models with LSTM dictionaries. Plus I only get this on garbage input like:
Nevertheless, if multiple loaded models compete in their access to the original segmentation (with one stealing from the others and leaving them in an inconsistent state), then that's a problem to be fixed, regardless of where or how it surfaces.
I just observed another grave problem with Tesseract's PSM_RAW_LINE
: it cannot cope with large horizontal white-space! For example, consider this line
Compare:
mode | text result | LSTM decoder |
---|---|---|
PSM_RAW_LINE |
Zahlungsempfänger: 0000 KON(IO: |
called on complete line |
PSM_SINGLE_LINE |
Zahlungsempfänger: Konto: |
called on both words separately |
Thus, obviously, the LSTM decoder itself is unable to output larger white-space, and in the process of trying to deliver something for nothing, enters very bad (unlikely) states which are even detrimental to follow-up material.
I am inclined to say: case closed, this is unusable. (And I shall remove it from #104 as well.)
But someone might still object that if you really know what you are doing (i.e. that you don't have any intruding components from neighbours or large white-space), then there may be a use-case for you.
@wrznr @kba @stweil
Two things: 1. I have the option to choose from both PSM modes via a CLI parameter, right? If so, pls. leave PSM_RAW_LINE
as the default. For me, it delivers better results in most cases. 2. I tend to say that the example you posted above is not a single line. It looks more like a table or tabbing where the layout/line recognition failed. We should not try to adjust our processors to the defective behavior of other processors.
- I have the option to choose from both PSM modes via a CLI parameter, right?
Yes, but PSM 13 (raw line) was only added late, it's not well documented, and, with the 2 issues described here, looks a lot like work in progress.
If so, pls. leave
PSM_RAW_LINE
as the default. For me, it delivers better results in most cases.
Then your workflow is sophisticated enough. (But I would be interested to know how you can prevent the large white-space issue.) But for the average user, using this module alone for line segmentation, this will likely deliver bad results. Simple workflows will loose far more than sophisticated workflows can gain. I would argue for the principle of least astonishment here: That you can still set raw_lines=true
if you have invested in an elaborate workflow and thus already know what to expect.
- I tend to say that the example you posted above is not a single line. It looks more like a table or tabbing where the layout/line recognition failed. We should not try to adjust our processors to the defective behavior of other processors.
That may be true, but again, it's what this module (i.e. Tesseract itself) delivers as line segmentation. And this goes for Ocropy as well (or even more, since it has no column detection and thus needs to fill hspace to prevent misordering). Also, we can expect to see this in less extreme cases, too. It's the LSTM decoder's general mode of operation which is the culprit here.
Okay! Changing the default might be a good idea then.
Our OCR-D wrappers have the advantage of allowing to isolate subtasks in a finely grained manner. Tesseract's CLI on the other hand must always provide a good all-in-one compromise, even when called with different PSMs specifically. (E.g. it will always binarize when still necessary for layout analysis, and attempt baseline+xheight+ascender prediction even in
PSM_SINGLE_LINE
.)Now, for some workflows it might be beneficial to suppress any additional Tesseract-internal segmentation on the provided line images – for instance when that image is cropped and masked from a line polygon already, or clipped already. Under these circumstances, we should rather use
PSM_RAW_LINE
.But other workflows will just enter with the same rough bounding boxes that Tesseract's CLI would also create internally. Then
PSM_SINGLE_LINE
is a better choice.So how do we encapsulate this without confusing users, but giving them the best possible results? Do we check the line segment's number of points (for polygon vs bbox workflow), and decide automatically, or expose this as a (thoroghly described) parameter?
@wrznr @kba @stweil