Word mismatch between HTML and PDF for visual linker

lukehsiao commented 6 years ago

In the test md document we use an ordered HTML list, which renders as numbers in the PDF. This is causing a mismatch of words when doing the visual parse. The list of words in that document are

HTML	PDF
Sample	Sample
Markdown	Markdown
This	This
is	is
some	some
basic	basic
,	,
sample	sample
markdown	markdown
.	.
Second	Second
Heading	Heading
Unordered	Unordered
lists	lists
,	,
and	and
:	:
One	1
Two	.
Three	One
More	2
Blockquote	.
And	Two
bold	3
,	.
italics	Three
,	More
and	Blockquote
even	And
italics	bold
and	,
later	italics
.	,
Even	and
bold	even
strikethrough	italics
.	and
A	later
link	bold
to	.
somewhere	Even
.	strikethrough
Here	.
is	A
a	link
table	to
Name	somewhere
Lunch	.
order	Here
Spicy	is
Owes	a
Joan	table
saag	Name
paneer	Lunch
medium	order
$	Spicy
11	Owes
Sally	Joan
vindaloo	saag
mild	paneer
$	medium
14	$11
Erin	Sally
lamb	vindaloo
madras	mild
HOT	$14
$	Erin
5	lamb
Or	madras
inline	HOT
code	$5
like	Or
var	inline
foo	code
=	like

'", 'var bar | foo '", '= ; | 'bar'; . | . Or | Or an | an image | image of | of bears | bears The | The end | end ... | ...

Notice that in the PDF words, the 1, 2, and 3 appear, whereas they do not in the HTML.

Ideally this will be resolved when we switch to pdftotree and the new parser.

HiromuHota commented 5 years ago

Can https://spacy.io/api/goldparse#align be useful to align two list of words?

senwu commented 5 years ago

Good catch! We will take a look this function. Thanks!

HiromuHota commented 4 years ago

@lukehsiao What exactly was the issue? I recognize that "the PDF words, the 1, 2, and 3 appear, whereas they do not in the HTML." but I couldn't see that "this is causing a mismatch of words when doing the visual parse".

md.xml.txt is the output of pdftotext -bbox-layout md.pdf md.xml.txt. According to this, the Sentence for "One" has the following bbox.

<line xMin="83.000000" yMin="155.556641" xMax="114.992188" yMax="168.845703">
   <word xMin="83.000000" yMin="155.556641" xMax="92.000000" yMax="168.845703">1.</word>
   <word xMin="95.000000" yMin="155.556641" xMax="114.992188" yMax="168.845703">One</word>
</line>

On the other hand, according to #433, the VisualLinker correctly assigned bbox to each Sentence including "One", "Two", "Three". For example,

    # Check bbox for "One"
    sentence = doc.sentences[4]
    assert sentence.text == "One"
    assert sentence.get_bbox() == Bbox(page=1, left=95, top=155, right=114, bottom=168)

I don't see any problem from the visual linker perspective.

Has the original issue been resolved over the years? or am I mistaken?

lukehsiao commented 4 years ago

Hmmm, interesting. This issue is old enough that I can't recall the specifics of the issue, not do I have an example that demonstrates it. I should've added more specifics when this issue was created. It is definitely possible this issue was resolved over time and just was not closed...

@SenWu, do you have any recollection? If not, I'd suggest we close this as stale and can reopen if we find an issue and have specific examples in the future.

senwu commented 4 years ago

Hi @HiromuHota ,

I think this issue is about word mismatch between HTML and PDF documents. In this case, some words appear in PDF but not in HTML such as 1, 2, and 3 and the order of words is not consistent in both documents. These two problems are causing word matching issues for visual linker. The reason for these problems is that we use different software to generate different types of documents (PDFs and HTMLs). The ideal case would be using the character-level parser to parse coordinates and document structure instead of existing tools such as Adobe.

Sen

HiromuHota commented 4 years ago

Thank you guys for your recollections. I think the 1st problem is easier to solve but the 2nd one is much harder or even wouldn't fix. They would never be guaranteed to be consistent as HTML does not convey how exactly (at the pixel level) it is rendered and displayed.

The visual linker is supposed to link words between HTML and PDF. Depending on your use case, the original source could be HTML or PDF as illustrated below.

If I could take one step further of what @SenWu suggested, I'd suggest to embed the coordinates into a HTML (maybe we should call this XML by then). As a result,

Fonduer Parser takes coordinates if they are embedded in HTML and attaches them to Sentence (and other data models like Cell and Table).
The current visual linker is separated as a utility function that can be used as a preprocessing before parser.

By this change, users would be supposed to embed coordinates in HTML on other own (they can use the visual linker but it's totally up to them). A bonus is that Fonduer would become conceptually easier to understand because the Parser treats HTML as a single source of all modal information.

FYI, one of my use cases is to extract information from scanned documents. So the source is PDF (more specifically, a scanned image in PDF, or non-searchable PDF). By applying OCR, I make searchable PDF. Then I use pdftotext to generate HTML (XML actually, e.g., the md.xml.txt in https://github.com/HazyResearch/fonduer/issues/12#issuecomment-639908945).

senwu commented 4 years ago

@HiromuHota, you are exactly right!

My comments are mainly focus on PDF documents as input data. If the input data is HTML, it would be very hard to find exactly words in some cases.

And I agree that your suggested solution that embeds the coordinates into html and it’s actually the solution we want to have. That solution has multiple benefits such as it won’t rely on Adobe, it supports scanned PDF nicely, and it allows more control by Fonduer parser.

HiromuHota commented 4 years ago

I've tested pdftotree for the first time and learned that pdftotree tests/data/pdf_simple/md.pdf gives me

<html><div id=1>
<header
    char='S a m p l e   M a r k d o w n ',
    top='37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 ',
    left='35.0 48.34765616 60.34765616 80.3398436 93.68749976000001 100.35546848 111.00781232 117.00781232 139.66015615999999 151.66015615999999 162.3125 175.66015615999999 189.00781231999997 201.00781231999997 218.3398436 ',
    bottom='61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 ', 
    right='48.34765616 60.34765616 80.3398436 93.68749976 100.35546848000001 111.00781232 117.00781232 139.66015615999999 151.66015615999999 162.3125 175.66015615999999 189.00781231999997 201.00781231999997 218.33984359999997 231.68749975999998 '
>Sample Markdown </header>

I think we can just take this approach that embeds coordinates (top, left, bottom, right) to the corresponding HTML element and let Fonduer parse it. I wonder if you guys have ever ingested this style of HTML into Fonduer... If you ask about my experience, I create similar format of HTML ("HTML-like") and let Fonduer parse it with custom VisualLinker of my own.

Let's make this approach as a first-class citizen in Fonduer as we have been doing this as a custom step.

HiromuHota commented 4 years ago

According to https://documents.icar-us.eu/documents/2016/12/report-on-file-formats-for-hand-written-text-recognition-htr-material.pdf, there are 4 major file formats for OCR:

PAGE XML
ALTO XML
ABBYY FineReader XML
hOCR

I'd propose to support hOCR because:

The first three formats are XML-based. On the other hand, hOCR is an XML-based format, but embedded in HTML/XHTML documents. Because of this characteristic, Fonduer can support hOCR by just extending the current parser.py.
Tesseract, the most popular open-source OCR, supports hOCR as an output format. This means that Fonduer will be able to parse Tesseract's output without any conversion.

I'll make a different feature request that eventually resolves this issue.

senwu commented 4 years ago

Awesome! I think your suggestion (hOCR) is super great!

HazyResearch / fonduer

Word mismatch between HTML and PDF for visual linker #12