Closed beckstefan closed 3 years ago
If you refer to the ImageRegion
segments appearing mostly around the borders (marked in green by PageViewer): you should try cropping first (ocrd-anybaseocr-crop
or ocrd-tesserocr-crop
). And on the third page, this happens in a densely set text line with heavy show-through. Probably the best you can do here is trying better binarization which reduces show-through. (If you are afraid of loosing glyph details to recognition, consider doing binarization repeatedly – once with higher threshold
for segmentation, and then again more noisily for recognition.)
No, I do not refer to the green ImageRegion
but to the violet floating region
. But yes, the ImageRegion
is annoying, too and I'll try as you propose.
Btw the images are cropped, but that's again a different issue.
No, I do not refer to the green
ImageRegion
but to the violetfloating region
.
Oh I see, pardon me.
Well, this was my first choice when mapping Tesseract's internal PolyBlock
types to PAGE's TextRegion/@type
. AFAICT floating
is the best translation for PULLOUT_TEXT
. (And the hOCR translation also renders it as ocr_textfloat
.)
Tesseract's documentation says about this kind of ColumnSpanningType
:
// It is a pullout, as left and right were not in the same column, but
// it doesn't go to the edge of its start and end.
return CST_PULLOUT
Maybe you'd like to post-process the output of the Tesseract segmentation?
Btw the images are cropped, but that's again a different issue.
...cropped to the DFG viewer margin perhaps? (Your screenshots don't contain the Border annotation.)
Maybe you can try to crop a second time... let me check: Nope, not possible unfortunately with both current croppers!
(But maybe we should allow cropping incrementally? We could have the user decide whether to repeat or compose cropping with an extra parameter overwrite_border
...)
Thank you. That does explain it!
Maybe I'll try post-correction later.
...cropped to the DFG viewer margin perhaps? (Your screenshots don't contain the Border annotation.)
There's no border to see because LAREX doesn't show it. Page Viewer does show it and it's very wide.
@beckstefan can we close this? Or do you want better documentation of the TextRegion/@type
cases (or pointers to PAGE docs)?
We could even take this further by exploiting all of Tesseract's internal attributes regarding text alignment...
I think a documentation what regions the tools produce and some hints for post-processing to get rid of "unwanted" could help a lot.
From my side, please feel free to close the issue.
I think a documentation what regions the tools produce and some hints for post-processing to get rid of "unwanted" could help a lot.
Wondering what the right place for that kind of description would be. It seems natural to say something about the region types detected directly in the docstring (which is fowarded by --help
). But we will have 3 CLI frontends for that (ocrd-tesserocr-segment-region
, ocrd-tesserocr-segment
and ocrd-tesseract-recognize
).
Also, describing how to do post-processing (like polygonalising text lines and clipping/shrinking text regions) seems useful, but would have to be included in many places redundantly. And explaining how TextRegion/@type
can be mapped programmatically is an altogether different kind of task (not user but developer documentation).
Maybe the README.md would be better suited for all this. @kba, what do you think?
Maybe the README.md would be better suited for all this. @kba, what do you think?
+1, for non-processor-specific documentation, README is the best place.
When using (recommended)
tesserocr-segment-region
we experience quite a lot of floating regions in normal text, which is of course not desirable. They appear relatively spontaneous. Are there any settings to minimize their occurrences?