Closed bertsky closed 4 years ago
Merging #121 into master will not change coverage by
%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## master #121 +/- ##
=======================================
Coverage 37.46% 37.46%
=======================================
Files 9 9
Lines 953 953
Branches 209 209
=======================================
Hits 357 357
Misses 532 532
Partials 64 64
Impacted Files | Coverage Δ | |
---|---|---|
ocrd_tesserocr/segment_region.py | 56.15% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 7026429...e0b652e. Read the comment docs.
Rationale
Tesseract offers a special page segmentation mode for text scattered arbitrarily across the page, called
PSM_SPARSE_TEXT
(orPSM_SPARSE
). It tries to recover as much text as possible without paying attention to size, order or non-text foreground, delivering text regions without vertical spaces (i.e. only single lines) or horizontal spaces exceeding single blanks (i.e. tabs). It is based on Tesseract's internal textline recognition, without any re-partitioning or column/table detection logic. (But it does undergo internal image and vertical/horizontal line suppression.)This can be useful in itself for tasks that try to find some text anywhere on pages, but also as an auxiliary step in regular (i.e. full) page segmentation methods. For example, one can combine this with other (perhaps even data-driven) segmentations, or use OCR as an intermediate step in guiding them.
Example:
(Of course, running this step after text-image segmentation – suppression of non-text – from other modules is worthwhile, too.)