Layout-Parser / layout-parser

A Unified Toolkit for Deep Learning Based Document Image Analysis
https://layout-parser.github.io/
Apache License 2.0
4.64k stars 449 forks source link

How to Improve Table Extraction #87

Open ehildebrandtrojo opened 2 years ago

ehildebrandtrojo commented 2 years ago

I am working with a large set of historical tables and need to extract the rows/columns in them. I ran various layout models from the Model Zoo, but the only one that gives me some interesting results is the HJDataset models. Still, the model does not do a good job consistently identifying the different features in the table (image attached). For instance, the model is able to detect a couple of rows/columns but not all of the ones present in the image (e.g. columns of numbers are well detected in the first table but not in subsequent tables).

Any advice/suggestions on how to best proceed?

LayoutParserDetection

lolipopshock commented 2 years ago

Thanks for brining up this issue, @ehildebrandtrojo ! Detecting complex archival tables is always a challenge and the solution can vary case by case. Generally I would say you can train a dedicated model using our toolkit for annotation/training (you can see more similar tables in our slack channel if you are interested). But we are recently working on some tools that can possibly make this process even easier -- please stay tuned for the updates!

kforcodeai commented 2 years ago

@ehildebrandtrojo can you please paste the table image without detection.

ehildebrandtrojo commented 2 years ago

Yeah for sure! @k-for-code

Tarnopol - Table 29_Page_24

anhhaibkhn commented 2 years ago

Great tools! However, I am also wondering the same thing as @ehildebrandtrojo on how to improve the table detection and extraction results. I also ran numerous trials with some invoice images, and I could not find any model in the current database which can give satisfactory results. I think one of the challenges here is that the documents have different types of tables ( fully bordered, borderless, no separated lines or columns, etc). Should we classify the docs into certain types first before we apply appropriate model or detection techniques? I also want to hear all of everyone's valuable suggestions.

@lolipopshock Is there any plan in the future that we can utilize some SoTA models in the leaderboard?