HazyResearch / fonduer

A knowledge base construction engine for richly formatted data
https://fonduer.readthedocs.io/
MIT License
407 stars 77 forks source link

Support hOCR #476

Closed HiromuHota closed 3 years ago

HiromuHota commented 4 years ago

Description of the feature request

Is your feature request related to a problem? Please describe.

One of my use cases is to extract information from scanned documents. Tesseract, the most popular open-source OCR, supports hOCR as an output format. I'd like Fonduer to support hOCR (aka html OCR) as an input format.

Description of the solution you'd like

  1. I'd like Fonduer to use "visual linker" during pre-processing if needed (e.g., plain HTML is used).
  2. I'd like Fonduer to extract visual information embedded in HTML when parsing.

Description of the alternatives you've considered

According to https://documents.icar-us.eu/documents/2016/12/report-on-file-formats-for-hand-written-text-recognition-htr-material.pdf, there are 4 major file formats for OCR:

  1. PAGE XML
  2. ALTO XML
  3. ABBYY FineReader XML
  4. hOCR

The first three formats are XML-based. On the other hand, hOCR is an XML-based format, but embedded in HTML/XHTML documents. Because of this characteristic, Fonduer can support hOCR by just extending the current parser.py.

Additional context

I find it hard for beginners to understand the concept of "visual linker". Fonduer uses four different modal information: textual, structural, tabular, and visual. The first three modals are extracted from HTML files, but the visual modal comes from corresponding PDF files. I think this difference introduces a friction for beginners.

With this requested feature, Fonduer will assume that an HTML file contains all four modal information when parsing. This would be easier to understand and would match what users would expect from the term "parsing".

HiromuHota commented 4 years ago

In FineReader XML to hOCR converter, they were discussing how to handle tables (<table />) in hOCR.

HiromuHota commented 4 years ago

https://github.com/cneud/ocr-conversion is a collection of scripts and stylesheets for conversion between various OCR formats.

HiromuHota commented 4 years ago

Granularities in different OCR formats image

HiromuHota commented 4 years ago

As illustrated in https://en.wikipedia.org/wiki/HOCR like below, an hOCR typically uses <span /> element for words

<p class='ocr_par' lang='deu' title="bbox930">
  <span class='ocr_line' title="bbox 348 797 1482 838; baseline -0.009 -6">
    <span class='ocrx_word' title='bbox 348 805 402 832; x_wconf 93'>Die</span> 
    <span class='ocrx_word' title='bbox 421 804 697 832; x_wconf 90'>Darlehenssumme</span> 
    <span class='ocrx_word' title='bbox 717 803 755 831; x_wconf 96'>ist</span> 
    <span class='ocrx_word' title='bbox 773 803 802 831; x_wconf 96'>in</span> 
    <span class='ocrx_word' title='bbox 821 803 917 830; x_wconf 96'>ihrem</span> 
    <span class='ocrx_word' title='bbox 935 799 1180 838; x_wconf 95'>ursprünglichen</span> 
    <span class='ocrx_word' title='bbox 1199 797 1343 832; x_wconf 95'>Umfange</span> 
    <span class='ocrx_word' title='bbox 1362 805 1399 823; x_wconf 95'>zu</span> 
    <span class='ocrx_word' title='bbox 1417 x_wconf 96'>ver-</span> 
  </span>
  ...

Meanwhile, fonduer.parser.Parser flattens <span /> elements by default (configurable by the flatten argument, though).

HiromuHota commented 4 years ago

What I would propose is illustrated by the diagram below. image

Here are things to be developed:

HOCRPreprocessor is for hOCR files, which flattens span.ocrx_word and span.ocr_line but preserves the title attribute of each span.ocrx_word so that p.ocr_par has a text node as a direct child.

Make VisualLinker a stand-alone tool that reads visual information from the PDF file and embeds them in the HTML file in the hOCR format.

A HTML file could be consumed by HTMLDocPreprocessor without any visual information. In this case, a new Visualizer might be required to visualize a candidate or an alternative debugger is required.

HiromuHota commented 3 years ago

Fixed by #519