Open lukehsiao opened 6 years ago
Can https://spacy.io/api/goldparse#align be useful to align two list of words?
Good catch! We will take a look this function. Thanks!
@lukehsiao What exactly was the issue? I recognize that "the PDF words, the 1, 2, and 3 appear, whereas they do not in the HTML." but I couldn't see that "this is causing a mismatch of words when doing the visual parse".
md.xml.txt is the output of pdftotext -bbox-layout md.pdf md.xml.txt
.
According to this, the Sentence
for "One" has the following bbox.
<line xMin="83.000000" yMin="155.556641" xMax="114.992188" yMax="168.845703">
<word xMin="83.000000" yMin="155.556641" xMax="92.000000" yMax="168.845703">1.</word>
<word xMin="95.000000" yMin="155.556641" xMax="114.992188" yMax="168.845703">One</word>
</line>
On the other hand, according to #433, the VisualLinker
correctly assigned bbox to each Sentence
including "One", "Two", "Three".
For example,
# Check bbox for "One"
sentence = doc.sentences[4]
assert sentence.text == "One"
assert sentence.get_bbox() == Bbox(page=1, left=95, top=155, right=114, bottom=168)
I don't see any problem from the visual linker perspective.
Has the original issue been resolved over the years? or am I mistaken?
Hmmm, interesting. This issue is old enough that I can't recall the specifics of the issue, not do I have an example that demonstrates it. I should've added more specifics when this issue was created. It is definitely possible this issue was resolved over time and just was not closed...
@SenWu, do you have any recollection? If not, I'd suggest we close this as stale and can reopen if we find an issue and have specific examples in the future.
Hi @HiromuHota ,
I think this issue is about word mismatch between HTML and PDF documents. In this case, some words appear in PDF but not in HTML such as 1
, 2
, and 3
and the order of words is not consistent in both documents. These two problems are causing word matching issues for visual linker. The reason for these problems is that we use different software to generate different types of documents (PDFs and HTMLs). The ideal case would be using the character-level parser to parse coordinates and document structure instead of existing tools such as Adobe.
Sen
Thank you guys for your recollections. I think the 1st problem is easier to solve but the 2nd one is much harder or even wouldn't fix. They would never be guaranteed to be consistent as HTML does not convey how exactly (at the pixel level) it is rendered and displayed.
The visual linker is supposed to link words between HTML and PDF. Depending on your use case, the original source could be HTML or PDF as illustrated below.
If I could take one step further of what @SenWu suggested, I'd suggest to embed the coordinates into a HTML (maybe we should call this XML by then). As a result,
Parser
takes coordinates if they are embedded in HTML and attaches them to Sentence
(and other data models like Cell
and Table
).By this change, users would be supposed to embed coordinates in HTML on other own (they can use the visual linker but it's totally up to them).
A bonus is that Fonduer would become conceptually easier to understand because the Parser
treats HTML as a single source of all modal information.
FYI, one of my use cases is to extract information from scanned documents. So the source is PDF (more specifically, a scanned image in PDF, or non-searchable PDF).
By applying OCR, I make searchable PDF. Then I use pdftotext
to generate HTML (XML actually, e.g., the md.xml.txt
in https://github.com/HazyResearch/fonduer/issues/12#issuecomment-639908945).
@HiromuHota, you are exactly right!
My comments are mainly focus on PDF documents as input data. If the input data is HTML, it would be very hard to find exactly words in some cases.
And I agree that your suggested solution that embeds the coordinates into html and it’s actually the solution we want to have. That solution has multiple benefits such as it won’t rely on Adobe, it supports scanned PDF nicely, and it allows more control by Fonduer parser.
I've tested pdftotree for the first time and learned that pdftotree tests/data/pdf_simple/md.pdf
gives me
<html><div id=1>
<header
char='S a m p l e M a r k d o w n ',
top='37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 ',
left='35.0 48.34765616 60.34765616 80.3398436 93.68749976000001 100.35546848 111.00781232 117.00781232 139.66015615999999 151.66015615999999 162.3125 175.66015615999999 189.00781231999997 201.00781231999997 218.3398436 ',
bottom='61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 ',
right='48.34765616 60.34765616 80.3398436 93.68749976 100.35546848000001 111.00781232 117.00781232 139.66015615999999 151.66015615999999 162.3125 175.66015615999999 189.00781231999997 201.00781231999997 218.33984359999997 231.68749975999998 '
>Sample Markdown </header>
I think we can just take this approach that embeds coordinates (top, left, bottom, right) to the corresponding HTML element and let Fonduer parse it.
I wonder if you guys have ever ingested this style of HTML into Fonduer...
If you ask about my experience, I create similar format of HTML ("HTML-like") and let Fonduer parse it with custom VisualLinker
of my own.
Let's make this approach as a first-class citizen in Fonduer as we have been doing this as a custom step.
According to https://documents.icar-us.eu/documents/2016/12/report-on-file-formats-for-hand-written-text-recognition-htr-material.pdf, there are 4 major file formats for OCR:
I'd propose to support hOCR because:
parser.py
.I'll make a different feature request that eventually resolves this issue.
Awesome! I think your suggestion (hOCR) is super great!
In the test md document we use an ordered HTML list, which renders as numbers in the PDF. This is causing a mismatch of words when doing the visual parse. The list of words in that document are
'", 'var bar | foo '", '= ; | 'bar'; . | . Or | Or an | an image | image of | of bears | bears The | The end | end ... | ...
Notice that in the PDF words, the
1
,2
, and3
appear, whereas they do not in the HTML.Ideally this will be resolved when we switch to
pdftotree
and the new parser.