atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.64k stars 354 forks source link

Poor table auto-detection? #304

Closed homofortis closed 5 years ago

homofortis commented 5 years ago

First of all, Camelot is a great project with enormous potential! In my experience, Camelot is properly extracting most of the tables from a document provided the right parameters have been supplied. However this is not always the case in the real world. Sometimes you have to deal with thousands of documents with different layouts and processing them one by one is not an option. It seems, auto-detection of tables in documents doesn't work very well at the moment. I tried to run a bulk table extraction from PDF documents with random layouts and the results were very poor. Probably, there is a need of a new robust bulk extraction method working for both framed and streamed tables which produces acceptable results. In other words, sometimes It may be worth trading accuracy for generalisation.

vinayak-mehta commented 5 years ago

@homofortis Did you randomly apply the flavors to that large set of PDFs? How did you go about measuring the accuracy of results? Automatically choosing the flavor based on tables in a PDF is a planned enhancement #211.

The second issue is improving table auto-detection in the flavors themselves. It would help if you could that post that set of documents along with your findings which would help us in development.