OCR-D / ocrd_tesserocr

Run tesseract with the tesserocr bindings with @OCR-D's interfaces
MIT License
38 stars 11 forks source link

Clipped page image #177

Closed bertsky closed 3 years ago

bertsky commented 3 years ago

During layout analysis on the page level, Tesseract internally detects images and separators. Their bounding boxes can be queried via iterators, but the precise polygons are not available, and for sparse mode, only text blocks are returned. Moreover, we usually do not suppress separators and images in the foreground of consumers, even if we do annotate them in PAGE. This often yields suboptimal results, especially for OCRs like Ocropy or Calamari.

This PR therefore queries the internal binarized and nontext-suppressed (i.e. clipped) page-level image and annotates it as page-level derived image (with additional features binarized,clipped). As long as consumers do not opt out of the usual AlternativeImage retrieval priority, they will therefore get images without nontextual foreground.

codecov[bot] commented 3 years ago

Codecov Report

Merging #177 (ac4c81b) into master (b755b26) will decrease coverage by 0.06%. The diff coverage is 16.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #177      +/-   ##
==========================================
- Coverage   31.45%   31.38%   -0.07%     
==========================================
  Files          12       12              
  Lines        1240     1252      +12     
  Branches      287      289       +2     
==========================================
+ Hits          390      393       +3     
- Misses        775      784       +9     
  Partials       75       75              
Impacted Files Coverage Δ
ocrd_tesserocr/binarize.py 18.57% <0.00%> (-2.75%) :arrow_down:
ocrd_tesserocr/recognize.py 30.72% <100.00%> (+0.29%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update b755b26...ac4c81b. Read the comment docs.

bertsky commented 3 years ago

IIUC this will prevent detection of false positives (regions w/o text) downstream, so LGTM.

No, it's not about non-text regions. It is about non-textual (image/separator) foreground parts in text segments across the hierarchy, from text regions down to glyphs (depending on where the images are consumed and whether the consumers use the clipped page-level AlternativeImage annotated here). And it only applies when using ocrd-tesserocr-segment*/recognize with segmentation_level=region (i.e. as top level segmenter).

For example, if you run tesserocr-segment followed by calamari-recognize, this PR ensures that Calamari gets to see textline images without intruding h/v-lines (at least as far as Tesseract detects them correctly). Or if you use tesserocr-segment-region followed by ocropy-segment, the line segmenter should see clean text blocks.