OCR-D / ocrd_tesserocr

Run tesseract with the tesserocr bindings with @OCR-D's interfaces
MIT License
38 stars 11 forks source link

segment-region: add Tesseract's sparse_text mode #121

Closed bertsky closed 4 years ago

bertsky commented 4 years ago

Rationale

Tesseract offers a special page segmentation mode for text scattered arbitrarily across the page, called PSM_SPARSE_TEXT (or PSM_SPARSE). It tries to recover as much text as possible without paying attention to size, order or non-text foreground, delivering text regions without vertical spaces (i.e. only single lines) or horizontal spaces exceeding single blanks (i.e. tabs). It is based on Tesseract's internal textline recognition, without any re-partitioning or column/table detection logic. (But it does undergo internal image and vertical/horizontal line suppression.)

This can be useful in itself for tasks that try to find some text anywhere on pages, but also as an auxiliary step in regular (i.e. full) page segmentation methods. For example, one can combine this with other (perhaps even data-driven) segmentations, or use OCR as an intermediate step in guiding them.

Example:

ocrd-segment-region-sparse-text

(Of course, running this step after text-image segmentation – suppression of non-text – from other modules is worthwhile, too.)

codecov[bot] commented 4 years ago

Codecov Report

Merging #121 into master will not change coverage by %. The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #121   +/-   ##
=======================================
  Coverage   37.46%   37.46%           
=======================================
  Files           9        9           
  Lines         953      953           
  Branches      209      209           
=======================================
  Hits          357      357           
  Misses        532      532           
  Partials       64       64           
Impacted Files Coverage Δ
ocrd_tesserocr/segment_region.py 56.15% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 7026429...e0b652e. Read the comment docs.