VikParuchuri / surya

OCR, layout analysis, reading order, table recognition in 90+ languages
https://www.datalab.to
GNU General Public License v3.0
14.39k stars 902 forks source link

Proposal: Support Generate hOCR Output #139

Open hcoona opened 5 months ago

hcoona commented 5 months ago

hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR). The definition encodes text, style, layout information, recognition confidence metrics and other information using Extensible Markup Language (XML) in the form of Hypertext Markup Language (HTML) or XHTML.

Support standard output can encourage other tools integration with Surya.

Tesseract OCR support hOCR output.

VikParuchuri commented 5 months ago

I will add this to the list of things to work on, but also happy to take a PR for it.

Backendmagier commented 3 months ago

+1 would love to have this :)