HazyResearch / fonduer

A knowledge base construction engine for richly formatted data
https://fonduer.readthedocs.io/
MIT License
409 stars 77 forks source link

Extracting Information from tables without Borders #502

Closed saikalyan9981 closed 4 years ago

saikalyan9981 commented 4 years ago

While extracting, Fonduer is not detecting the table structure of tables without borders. For example, in given figure: table_without_border I like to extract mentions under all the columns. Without Fonduer detecting table structure, it's difficult to extract, especially the mentions under column "Artikel (GTIN/EAN)" as they are quite long. Can you please help me, on how to solve this problem?

As of now, I'm doing this: I'm extracting mentions under Artikel Nr. and Verpackungseinheit(VE) using regex, and having a label function which horizontally aligns them. and based on bbox values of extracted mentions, getting the text between them using PDF Miner.

Can you please suggest me a better method or refer to some tool or papers, because many times, my process is failing to extract correct "Artikel (GTIN/EAN)"

Image Reference: Shown only for illustrative purpose. My actual document doesn't even contain horizontal lines/borders

HiromuHota commented 4 years ago

Fonduer does not detect tables in an image. You need to provide tabular information. If you can only provide textual and visual (bbox) information, what you are currently doing would be the best thing you could do.

How to detect tables? It seems that ABBYY FineReader can detect tables (https://abbyy.technology/en:features:ocr:xml). There would be some open-source alternatives too.

Hope this helps.

saikalyan9981 commented 4 years ago

Thanks @HiromuHota , I will check that out, Also I wasn't extracting for images I was trying to say that Fonduer isn't detecting borderless tables while parsing a document.

HiromuHota commented 4 years ago

From the image that you uploaded, I thought you were dealing with scanned documents. If it is not, what is your original source? HTML or PDF? If PDF, how do you convert PDF to HTML?

Basically, Fonduer does not care about how tables look to human eyes. What really matters is how the table is represented in HTML. To be treated as a table, the table (at least it looks to be a table to human-eye) should be is represented in <table> with correct <tr> and <td>.

If you can show us the HTML file (you can mask sensitive parts), that would be helpful.

saikalyan9981 commented 4 years ago

Yes this is issue with HTML conversion of tables in PDF and not of Fonduer.