OCR-D / ocrd_tesserocr

Run tesseract with the tesserocr bindings with @OCR-D's interfaces
MIT License
38 stars 11 forks source link

Floating regions #155

Closed beckstefan closed 3 years ago

beckstefan commented 3 years ago

When using (recommended) tesserocr-segment-region we experience quite a lot of floating regions in normal text, which is of course not desirable. They appear relatively spontaneous. Are there any settings to minimize their occurrences?

floating_1 floating_2 floating_3

bertsky commented 3 years ago

If you refer to the ImageRegion segments appearing mostly around the borders (marked in green by PageViewer): you should try cropping first (ocrd-anybaseocr-crop or ocrd-tesserocr-crop). And on the third page, this happens in a densely set text line with heavy show-through. Probably the best you can do here is trying better binarization which reduces show-through. (If you are afraid of loosing glyph details to recognition, consider doing binarization repeatedly – once with higher threshold for segmentation, and then again more noisily for recognition.)

beckstefan commented 3 years ago

No, I do not refer to the green ImageRegion but to the violet floating region. But yes, the ImageRegion is annoying, too and I'll try as you propose.

Btw the images are cropped, but that's again a different issue.

bertsky commented 3 years ago

No, I do not refer to the green ImageRegion but to the violet floating region.

Oh I see, pardon me.

Well, this was my first choice when mapping Tesseract's internal PolyBlock types to PAGE's TextRegion/@type. AFAICT floating is the best translation for PULLOUT_TEXT. (And the hOCR translation also renders it as ocr_textfloat.)

Tesseract's documentation says about this kind of ColumnSpanningType:

    // It is a pullout, as left and right were not in the same column, but
    // it doesn't go to the edge of its start and end.
    return CST_PULLOUT

Maybe you'd like to post-process the output of the Tesseract segmentation?

Btw the images are cropped, but that's again a different issue.

...cropped to the DFG viewer margin perhaps? (Your screenshots don't contain the Border annotation.)

Maybe you can try to crop a second time... let me check: Nope, not possible unfortunately with both current croppers!

(But maybe we should allow cropping incrementally? We could have the user decide whether to repeat or compose cropping with an extra parameter overwrite_border...)

beckstefan commented 3 years ago

Thank you. That does explain it!

Maybe I'll try post-correction later.

...cropped to the DFG viewer margin perhaps? (Your screenshots don't contain the Border annotation.)

There's no border to see because LAREX doesn't show it. Page Viewer does show it and it's very wide.

bertsky commented 3 years ago

@beckstefan can we close this? Or do you want better documentation of the TextRegion/@type cases (or pointers to PAGE docs)?

We could even take this further by exploiting all of Tesseract's internal attributes regarding text alignment...

beckstefan commented 3 years ago

I think a documentation what regions the tools produce and some hints for post-processing to get rid of "unwanted" could help a lot.

From my side, please feel free to close the issue.

bertsky commented 3 years ago

I think a documentation what regions the tools produce and some hints for post-processing to get rid of "unwanted" could help a lot.

Wondering what the right place for that kind of description would be. It seems natural to say something about the region types detected directly in the docstring (which is fowarded by --help). But we will have 3 CLI frontends for that (ocrd-tesserocr-segment-region, ocrd-tesserocr-segment and ocrd-tesseract-recognize).

Also, describing how to do post-processing (like polygonalising text lines and clipping/shrinking text regions) seems useful, but would have to be included in many places redundantly. And explaining how TextRegion/@type can be mapped programmatically is an altogether different kind of task (not user but developer documentation).

Maybe the README.md would be better suited for all this. @kba, what do you think?

kba commented 3 years ago

Maybe the README.md would be better suited for all this. @kba, what do you think?

+1, for non-processor-specific documentation, README is the best place.