Not able to read pdf tables spread across multiple pages

atlanhq / camelot

Camelot: PDF Table Extraction for Humans

https://camelot-py.readthedocs.io

Other

3.64k stars 354 forks source link

Not able to read pdf tables spread across multiple pages #278

Closed PraveenDevop closed 5 years ago

PraveenDevop commented 5 years ago

Since my table is spread across multiple pages and if the horizontal line is missing, i am not able to read the table Resume1.pdf PDF link:https://onedrive.live.com/?id=690704CAD1449D85%21105&cid=690704CAD1449D85

In the pdf attached i am not able to read the 2x2 table. Can we identify tables only using vertical lines or can we detect tables by combining boundaries from both pages

I am using python 2.7 and camelot version camelot-py==0.7.2

PraveenDevop commented 5 years ago

Code used: import pandas as pd import camelot FileName="Filepath" DF3=camelot.read_pdf(FileName,multiple_tables=True,options="--pages 'all'", lattice= True) print DF3

anakin87 commented 5 years ago

You can try to extract tables with flavor='stream', but the output is not so good.

I think Camelot currently doesn't extract a table spreading across multiple pages as a single table, but as different tables. It would be a nice to have feature.

vinayak-mehta commented 5 years ago

@anakin87 is correct, tables spanning across multiple pages aren't extracted into a single table currently.

@PraveenDevop, where did you find the following code usage?

DF3=camelot.read_pdf(FileName,multiple_tables=True,options="--pages 'all'", lattice= True)

multiple_tables, options and lattice aren't valid keyword arguments supported by the library.

PraveenDevop commented 5 years ago

This is from tabula arguments..

vinayak-mehta commented 5 years ago

Tabula keyword arguments won't work inside Camelot. You can check out the advanced guide to see what keyword arguments Camelot supports.

PraveenDevop commented 5 years ago

using the below code also not able to detect the tables import pandas as pd import camelot FileName="Filepath" tables = camelot.read_pdf(FileName,pages='1-end') print tables

anakin87 commented 5 years ago

I suggest to use flavor='stream':

tables = camelot.read_pdf(FileName,pages='1-end',flavor='stream')