HazyResearch / fonduer

A knowledge base construction engine for richly formatted data
https://fonduer.readthedocs.io/
MIT License
408 stars 78 forks source link

Word mismatch between HTML and PDF for visual linker #12

Open lukehsiao opened 6 years ago

lukehsiao commented 6 years ago

In the test md document we use an ordered HTML list, which renders as numbers in the PDF. This is causing a mismatch of words when doing the visual parse. The list of words in that document are

HTML PDF
Sample Sample
Markdown Markdown
This This
is is
some some
basic basic
, ,
sample sample
markdown markdown
. .
Second Second
Heading Heading
Unordered Unordered
lists lists
, ,
and and
: :
One 1
Two .
Three One
More 2
Blockquote .
And Two
bold 3
, .
italics Three
, More
and Blockquote
even And
italics bold
and ,
later italics
. ,
Even and
bold even
strikethrough italics
. and
A later
link bold
to .
somewhere Even
. strikethrough
Here .
is A
a link
table to
Name somewhere
Lunch .
order Here
Spicy is
Owes a
Joan table
saag Name
paneer Lunch
medium order
$ Spicy
11 Owes
Sally Joan
vindaloo saag
mild paneer
$ medium
14 $11
Erin Sally
lamb vindaloo
madras mild
HOT $14
$ Erin
5 lamb
Or madras
inline HOT
code $5
like Or
var inline
foo code
= like

'", 'var bar | foo '", '= ; | 'bar'; . | . Or | Or an | an image | image of | of bears | bears The | The end | end ... | ...

Notice that in the PDF words, the 1, 2, and 3 appear, whereas they do not in the HTML.

Ideally this will be resolved when we switch to pdftotree and the new parser.

HiromuHota commented 5 years ago

Can https://spacy.io/api/goldparse#align be useful to align two list of words?

senwu commented 5 years ago

Good catch! We will take a look this function. Thanks!

HiromuHota commented 4 years ago

@lukehsiao What exactly was the issue? I recognize that "the PDF words, the 1, 2, and 3 appear, whereas they do not in the HTML." but I couldn't see that "this is causing a mismatch of words when doing the visual parse".

md.xml.txt is the output of pdftotext -bbox-layout md.pdf md.xml.txt. According to this, the Sentence for "One" has the following bbox.

<line xMin="83.000000" yMin="155.556641" xMax="114.992188" yMax="168.845703">
   <word xMin="83.000000" yMin="155.556641" xMax="92.000000" yMax="168.845703">1.</word>
   <word xMin="95.000000" yMin="155.556641" xMax="114.992188" yMax="168.845703">One</word>
</line>

On the other hand, according to #433, the VisualLinker correctly assigned bbox to each Sentence including "One", "Two", "Three". For example,

    # Check bbox for "One"
    sentence = doc.sentences[4]
    assert sentence.text == "One"
    assert sentence.get_bbox() == Bbox(page=1, left=95, top=155, right=114, bottom=168)

I don't see any problem from the visual linker perspective.

Has the original issue been resolved over the years? or am I mistaken?

lukehsiao commented 4 years ago

Hmmm, interesting. This issue is old enough that I can't recall the specifics of the issue, not do I have an example that demonstrates it. I should've added more specifics when this issue was created. It is definitely possible this issue was resolved over time and just was not closed...

@SenWu, do you have any recollection? If not, I'd suggest we close this as stale and can reopen if we find an issue and have specific examples in the future.

senwu commented 4 years ago

Hi @HiromuHota ,

I think this issue is about word mismatch between HTML and PDF documents. In this case, some words appear in PDF but not in HTML such as 1, 2, and 3 and the order of words is not consistent in both documents. These two problems are causing word matching issues for visual linker. The reason for these problems is that we use different software to generate different types of documents (PDFs and HTMLs). The ideal case would be using the character-level parser to parse coordinates and document structure instead of existing tools such as Adobe.

Sen

HiromuHota commented 4 years ago

Thank you guys for your recollections. I think the 1st problem is easier to solve but the 2nd one is much harder or even wouldn't fix. They would never be guaranteed to be consistent as HTML does not convey how exactly (at the pixel level) it is rendered and displayed.

The visual linker is supposed to link words between HTML and PDF. Depending on your use case, the original source could be HTML or PDF as illustrated below.

image

If I could take one step further of what @SenWu suggested, I'd suggest to embed the coordinates into a HTML (maybe we should call this XML by then). As a result,

  1. Fonduer Parser takes coordinates if they are embedded in HTML and attaches them to Sentence (and other data models like Cell and Table).
  2. The current visual linker is separated as a utility function that can be used as a preprocessing before parser.

By this change, users would be supposed to embed coordinates in HTML on other own (they can use the visual linker but it's totally up to them). A bonus is that Fonduer would become conceptually easier to understand because the Parser treats HTML as a single source of all modal information.

FYI, one of my use cases is to extract information from scanned documents. So the source is PDF (more specifically, a scanned image in PDF, or non-searchable PDF). By applying OCR, I make searchable PDF. Then I use pdftotext to generate HTML (XML actually, e.g., the md.xml.txt in https://github.com/HazyResearch/fonduer/issues/12#issuecomment-639908945). image

senwu commented 4 years ago

@HiromuHota, you are exactly right!

My comments are mainly focus on PDF documents as input data. If the input data is HTML, it would be very hard to find exactly words in some cases.

And I agree that your suggested solution that embeds the coordinates into html and it’s actually the solution we want to have. That solution has multiple benefits such as it won’t rely on Adobe, it supports scanned PDF nicely, and it allows more control by Fonduer parser.

HiromuHota commented 4 years ago

I've tested pdftotree for the first time and learned that pdftotree tests/data/pdf_simple/md.pdf gives me

<html><div id=1>
<header
    char='S a m p l e   M a r k d o w n ',
    top='37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 ',
    left='35.0 48.34765616 60.34765616 80.3398436 93.68749976000001 100.35546848 111.00781232 117.00781232 139.66015615999999 151.66015615999999 162.3125 175.66015615999999 189.00781231999997 201.00781231999997 218.3398436 ',
    bottom='61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 ', 
    right='48.34765616 60.34765616 80.3398436 93.68749976 100.35546848000001 111.00781232 117.00781232 139.66015615999999 151.66015615999999 162.3125 175.66015615999999 189.00781231999997 201.00781231999997 218.33984359999997 231.68749975999998 '
>Sample Markdown </header>

I think we can just take this approach that embeds coordinates (top, left, bottom, right) to the corresponding HTML element and let Fonduer parse it. I wonder if you guys have ever ingested this style of HTML into Fonduer... If you ask about my experience, I create similar format of HTML ("HTML-like") and let Fonduer parse it with custom VisualLinker of my own.

Let's make this approach as a first-class citizen in Fonduer as we have been doing this as a custom step.

HiromuHota commented 4 years ago

According to https://documents.icar-us.eu/documents/2016/12/report-on-file-formats-for-hand-written-text-recognition-htr-material.pdf, there are 4 major file formats for OCR:

  1. PAGE XML
  2. ALTO XML
  3. ABBYY FineReader XML
  4. hOCR

I'd propose to support hOCR because:

  1. The first three formats are XML-based. On the other hand, hOCR is an XML-based format, but embedded in HTML/XHTML documents. Because of this characteristic, Fonduer can support hOCR by just extending the current parser.py.
  2. Tesseract, the most popular open-source OCR, supports hOCR as an output format. This means that Fonduer will be able to parse Tesseract's output without any conversion.

I'll make a different feature request that eventually resolves this issue.

senwu commented 4 years ago

Awesome! I think your suggestion (hOCR) is super great!