atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.62k stars 350 forks source link

Steam : Treating whole page as single table. Ignores from table 2. #307

Closed csrinivascreator closed 5 years ago

csrinivascreator commented 5 years ago

When using stream, tables after the first table aren’t autodetected. The stream is currently treating the whole page as a single table. Because of which only the first table on the page is extracted when a page has multiple tables. Will be good if there is an attribute to instruct to Stream that there are multiple tables in a page and recognizing which it should loop for multiple table extraction.

vinayak-mehta commented 5 years ago

@csrinivascreator Can you please post an example that can help us to understand the issue better?

csrinivascreator commented 5 years ago

tables_stream = camelot.read_pdf(pdf, flavor='stream', pages='all', table_areas=['0,842,595,0'])

I have used this approach and this worked as an alternate to make stream read the complete page and give all the tables as output. The output looks good for an A4 size PDF and can be used as a workaround.

Thank You for your quick response. You can close this issue as a workaround is available. I have also requested #308 to follow the same approach.

vinayak-mehta commented 5 years ago

Cool

ziaulrehman40 commented 4 years ago

@csrinivascreator This param you used might have solved the issue for the structure you had. @vinayak-mehta I think we should re-open this issue i can provide example PDF page i have extracted.

Following are details:

here is the example file

There are 4 tables on the page, and i am getting only 1(all 4 merged into 1) with following:

camelot.read_pdf('stream multi_table_issue.pdf', flavor = 'stream')

I ran the plot utility and here are outputs: Figure_1

And

Figure_2

I tried the parameter suggested by @christinegarcia but didn't seem to work(unless i am doing something wrong).

ziaulrehman40 commented 4 years ago

Please at-least re-open the issue, so any other contributors or general public can see this and possibly suggest some solutions. Or i will have to open a new issue i guess.

Also for anyone who lands here, i tried the above sample pdf of mine with tabula(through its web interface). And it seems to detect separate tables instead of one long table.

But we are going to have many formats of tables so we are still trying to find best fit.

I have number of other parsing issues with both camelot and tabula right now, which i will try to open issues for separately with sample PDFs so they can potentially be fixed in future.

NOTE: I found some paid services, which seem to work a lot more accurately. Not to advertise but this service seems to work great for most of our cases, at-least in the demo/free pass.