Closed HiromuHota closed 3 years ago
In FineReader XML to hOCR converter, they were discussing how to handle tables (<table />
) in hOCR.
https://github.com/cneud/ocr-conversion is a collection of scripts and stylesheets for conversion between various OCR formats.
As illustrated in https://en.wikipedia.org/wiki/HOCR like below, an hOCR typically uses <span />
element for words
<p class='ocr_par' lang='deu' title="bbox930">
<span class='ocr_line' title="bbox 348 797 1482 838; baseline -0.009 -6">
<span class='ocrx_word' title='bbox 348 805 402 832; x_wconf 93'>Die</span>
<span class='ocrx_word' title='bbox 421 804 697 832; x_wconf 90'>Darlehenssumme</span>
<span class='ocrx_word' title='bbox 717 803 755 831; x_wconf 96'>ist</span>
<span class='ocrx_word' title='bbox 773 803 802 831; x_wconf 96'>in</span>
<span class='ocrx_word' title='bbox 821 803 917 830; x_wconf 96'>ihrem</span>
<span class='ocrx_word' title='bbox 935 799 1180 838; x_wconf 95'>ursprünglichen</span>
<span class='ocrx_word' title='bbox 1199 797 1343 832; x_wconf 95'>Umfange</span>
<span class='ocrx_word' title='bbox 1362 805 1399 823; x_wconf 95'>zu</span>
<span class='ocrx_word' title='bbox 1417 x_wconf 96'>ver-</span>
</span>
...
Meanwhile, fonduer.parser.Parser
flattens <span />
elements by default (configurable by the flatten
argument, though).
What I would propose is illustrated by the diagram below.
Here are things to be developed:
HOCRDocPreprocessor
(new)VisualLinker
(changed)HOCRPreprocessor
is for hOCR files, which flattens span.ocrx_word
and span.ocr_line
but preserves the title attribute of each span.ocrx_word
so that p.ocr_par
has a text node as a direct child.
Make VisualLinker
a stand-alone tool that reads visual information from the PDF file and embeds them in the HTML file in the hOCR format.
A HTML file could be consumed by HTMLDocPreprocessor
without any visual information. In this case, a new Visualizer
might be required to visualize a candidate or an alternative debugger is required.
Fixed by #519
Description of the feature request
Is your feature request related to a problem? Please describe.
One of my use cases is to extract information from scanned documents. Tesseract, the most popular open-source OCR, supports hOCR as an output format. I'd like Fonduer to support hOCR (aka html OCR) as an input format.
Description of the solution you'd like
Description of the alternatives you've considered
According to https://documents.icar-us.eu/documents/2016/12/report-on-file-formats-for-hand-written-text-recognition-htr-material.pdf, there are 4 major file formats for OCR:
The first three formats are XML-based. On the other hand, hOCR is an XML-based format, but embedded in HTML/XHTML documents. Because of this characteristic, Fonduer can support hOCR by just extending the current
parser.py
.Additional context
I find it hard for beginners to understand the concept of "visual linker". Fonduer uses four different modal information: textual, structural, tabular, and visual. The first three modals are extracted from HTML files, but the visual modal comes from corresponding PDF files. I think this difference introduces a friction for beginners.
With this requested feature, Fonduer will assume that an HTML file contains all four modal information when parsing. This would be easier to understand and would match what users would expect from the term "parsing".