Closed bertsky closed 3 years ago
Merging #177 (ac4c81b) into master (b755b26) will decrease coverage by
0.06%
. The diff coverage is16.66%
.
@@ Coverage Diff @@
## master #177 +/- ##
==========================================
- Coverage 31.45% 31.38% -0.07%
==========================================
Files 12 12
Lines 1240 1252 +12
Branches 287 289 +2
==========================================
+ Hits 390 393 +3
- Misses 775 784 +9
Partials 75 75
Impacted Files | Coverage Δ | |
---|---|---|
ocrd_tesserocr/binarize.py | 18.57% <0.00%> (-2.75%) |
:arrow_down: |
ocrd_tesserocr/recognize.py | 30.72% <100.00%> (+0.29%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update b755b26...ac4c81b. Read the comment docs.
IIUC this will prevent detection of false positives (regions w/o text) downstream, so LGTM.
No, it's not about non-text regions. It is about non-textual (image/separator) foreground parts in text segments across the hierarchy, from text regions down to glyphs (depending on where the images are consumed and whether the consumers use the clipped page-level AlternativeImage annotated here). And it only applies when using ocrd-tesserocr-segment*/recognize with segmentation_level=region
(i.e. as top level segmenter).
For example, if you run tesserocr-segment followed by calamari-recognize, this PR ensures that Calamari gets to see textline images without intruding h/v-lines (at least as far as Tesseract detects them correctly). Or if you use tesserocr-segment-region followed by ocropy-segment, the line segmenter should see clean text blocks.
During layout analysis on the page level, Tesseract internally detects images and separators. Their bounding boxes can be queried via iterators, but the precise polygons are not available, and for sparse mode, only text blocks are returned. Moreover, we usually do not suppress separators and images in the foreground of consumers, even if we do annotate them in PAGE. This often yields suboptimal results, especially for OCRs like Ocropy or Calamari.
This PR therefore queries the internal binarized and nontext-suppressed (i.e. clipped) page-level image and annotates it as page-level derived image (with additional features
binarized,clipped
). As long as consumers do not opt out of the usual AlternativeImage retrieval priority, they will therefore get images without nontextual foreground.