recognize: use PSM_RAW_LINE instead of PSM_SINGLE_LINE

bertsky commented 4 years ago

Our OCR-D wrappers have the advantage of allowing to isolate subtasks in a finely grained manner. Tesseract's CLI on the other hand must always provide a good all-in-one compromise, even when called with different PSMs specifically. (E.g. it will always binarize when still necessary for layout analysis, and attempt baseline+xheight+ascender prediction even in PSM_SINGLE_LINE.)

Now, for some workflows it might be beneficial to suppress any additional Tesseract-internal segmentation on the provided line images – for instance when that image is cropped and masked from a line polygon already, or clipped already. Under these circumstances, we should rather use PSM_RAW_LINE.

But other workflows will just enter with the same rough bounding boxes that Tesseract's CLI would also create internally. Then PSM_SINGLE_LINE is a better choice.

So how do we encapsulate this without confusing users, but giving them the best possible results? Do we check the line segment's number of points (for polygon vs bbox workflow), and decide automatically, or expose this as a (thoroghly described) parameter?

@wrznr @kba @stweil

kba commented 4 years ago

expose this as a (thoroghly described) parameter?

☝️ This, with the first three paragraphs of your issue as the documentation.

bertsky commented 4 years ago

Seriously: how long may description be? Is multi-line allowed?

bertsky commented 4 years ago

I'm afraid PSM_RAW_LINE is still buggy in Tesseract: It sometimes crashes in Tesseract::recog_all_words because there is an attempt to DeleteCurrentWord() when nothing is there to delete. It appears to depend on the model, too: I have not seen this happen with models from tessdata/_best, but it happens all the time with models from tesstrain.

I will investigate (and hopefully fix) in Tesseract...

stweil commented 4 years ago

That's typically caused by models which have different unicharsets for legacy and LSTM recognizer when Tesseract uses the wrong unicharset.

I fixed some of those cases in the past.

Is there already an issue for the crash?

bertsky commented 4 years ago

That's typically caused by models which have different unicharsets for legacy and LSTM recognizer when Tesseract uses the wrong unicharset.

I see. Thanks for the hint!

bertsky commented 4 years ago

That's typically caused by models which have different unicharsets for legacy and LSTM recognizer when Tesseract uses the wrong unicharset.

I see. Thanks for the hint!

Yes, we're on the right track here: This only happens when there are more than 2 models loaded which exhibit the Failed to load any lstm-specific dictionaries for lang warning (like GT4HistOCR). Only 1 model is always fine, as are multiple LSTM models with LSTM dictionaries. Plus I only get this on garbage input like:

OCR-D-IMG-DEWARP_0003_TextRegion_1475753831444_264_line_1475753831522_266

Nevertheless, if multiple loaded models compete in their access to the original segmentation (with one stealing from the others and leaving them in an inconsistent state), then that's a problem to be fixed, regardless of where or how it surfaces.

bertsky commented 4 years ago

I just observed another grave problem with Tesseract's PSM_RAW_LINE: it cannot cope with large horizontal white-space! For example, consider this line

OCR-D-IMG-LINES_0026_region0058_0000_region0058_0000_line0000 bin

Compare:

mode	text result	LSTM decoder
`PSM_RAW_LINE`	`Zahlungsempfänger: 0000 KON(IO:`	called on complete line
`PSM_SINGLE_LINE`	`Zahlungsempfänger: Konto:`	called on both words separately

Thus, obviously, the LSTM decoder itself is unable to output larger white-space, and in the process of trying to deliver something for nothing, enters very bad (unlikely) states which are even detrimental to follow-up material.

I am inclined to say: case closed, this is unusable. (And I shall remove it from #104 as well.)

But someone might still object that if you really know what you are doing (i.e. that you don't have any intruding components from neighbours or large white-space), then there may be a use-case for you.

@wrznr @kba @stweil

wrznr commented 4 years ago

Two things: 1. I have the option to choose from both PSM modes via a CLI parameter, right? If so, pls. leave PSM_RAW_LINE as the default. For me, it delivers better results in most cases. 2. I tend to say that the example you posted above is not a single line. It looks more like a table or tabbing where the layout/line recognition failed. We should not try to adjust our processors to the defective behavior of other processors.

bertsky commented 4 years ago

I have the option to choose from both PSM modes via a CLI parameter, right?

Yes, but PSM 13 (raw line) was only added late, it's not well documented, and, with the 2 issues described here, looks a lot like work in progress.

If so, pls. leave PSM_RAW_LINE as the default. For me, it delivers better results in most cases.

Then your workflow is sophisticated enough. (But I would be interested to know how you can prevent the large white-space issue.) But for the average user, using this module alone for line segmentation, this will likely deliver bad results. Simple workflows will loose far more than sophisticated workflows can gain. I would argue for the principle of least astonishment here: That you can still set raw_lines=true if you have invested in an elaborate workflow and thus already know what to expect.

I tend to say that the example you posted above is not a single line. It looks more like a table or tabbing where the layout/line recognition failed. We should not try to adjust our processors to the defective behavior of other processors.

That may be true, but again, it's what this module (i.e. Tesseract itself) delivers as line segmentation. And this goes for Ocropy as well (or even more, since it has no column detection and thus needs to fill hspace to prevent misordering). Also, we can expect to see this in less extreme cases, too. It's the LSTM decoder's general mode of operation which is the culprit here.

wrznr commented 4 years ago

Okay! Changing the default might be a good idea then.

OCR-D / ocrd_tesserocr

recognize: use PSM_RAW_LINE instead of PSM_SINGLE_LINE #101