Closed saikalyan9981 closed 4 years ago
Fonduer does not detect tables in an image. You need to provide tabular information. If you can only provide textual and visual (bbox) information, what you are currently doing would be the best thing you could do.
How to detect tables? It seems that ABBYY FineReader can detect tables (https://abbyy.technology/en:features:ocr:xml). There would be some open-source alternatives too.
Hope this helps.
Thanks @HiromuHota , I will check that out, Also I wasn't extracting for images I was trying to say that Fonduer isn't detecting borderless tables while parsing a document.
From the image that you uploaded, I thought you were dealing with scanned documents. If it is not, what is your original source? HTML or PDF? If PDF, how do you convert PDF to HTML?
Basically, Fonduer does not care about how tables look to human eyes.
What really matters is how the table is represented in HTML.
To be treated as a table, the table (at least it looks to be a table to human-eye) should be is represented in <table>
with correct <tr>
and <td>
.
If you can show us the HTML file (you can mask sensitive parts), that would be helpful.
Yes this is issue with HTML conversion of tables in PDF and not of Fonduer.
While extracting, Fonduer is not detecting the table structure of tables without borders. For example, in given figure: I like to extract mentions under all the columns. Without Fonduer detecting table structure, it's difficult to extract, especially the mentions under column "Artikel (GTIN/EAN)" as they are quite long. Can you please help me, on how to solve this problem?
As of now, I'm doing this: I'm extracting mentions under Artikel Nr. and Verpackungseinheit(VE) using regex, and having a label function which horizontally aligns them. and based on bbox values of extracted mentions, getting the text between them using PDF Miner.
Can you please suggest me a better method or refer to some tool or papers, because many times, my process is failing to extract correct "Artikel (GTIN/EAN)"
Image Reference: Shown only for illustrative purpose. My actual document doesn't even contain horizontal lines/borders