HazyResearch / fonduer

A knowledge base construction engine for richly formatted data
https://fonduer.readthedocs.io/
MIT License
407 stars 77 forks source link

BBox value errors #507

Closed saikalyan9981 closed 3 years ago

saikalyan9981 commented 4 years ago

While getting bbox values, say left values for some sentences, the left value for "\n" in the sentence is being an outlier. which is causing the bbox of entire sentence as wrong. [which is calculated as min of all lefts in sentence]

For Example if sentence is [Hello World] with words as [Hello,"\n", World] the left values are like [ 213,154,213]. This is observed when Hello\nWorld sentence is in middle of a page of a pdf like this: page elements: | abc ` ` ` Hello ` ` ` efg | | def ` ` ` world ` ` ` hij |

please, let me know if i need to add any other details.

HiromuHota commented 4 years ago

Duplicate of #12

saikalyan9981 commented 4 years ago

Thanks @HiromuHota , I think that's the root cause for this issue. Any idea, if hOCR will be included in next version?

HiromuHota commented 4 years ago

The current visual linker links words between HTML and PDF by just looking at similarities of each word. As a result, common words, which appear many times in a document, tend to be mis-linked.

I'm currently working on hOCR. As this would require an architectural change, it would need to increase a minor version like (v0.9.X). Please stay tuned.