atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.64k stars 354 forks source link

Table placed in a single row #270

Closed aperna1 closed 5 years ago

aperna1 commented 5 years ago

I'm trying to return the contents inside the table in each page (the table in this case is the whatever is in the rectangle in the bottom third of each page). The script seems to recognize the table and the content inside properly, but reproduces the table in one row rather than breaking it up into multiple rows and columns. Is there a way to reproduce the table as it is in the pdf, rather than in one row?

PDF: 02012018.pdf

My Script: pdf = 'pdf/02012018.pdf'

tables = camelot.read_pdf(pdf, pages = '1', process_background = True, split_text=True, flag_size=True) #strip_text=' .\n')

print tables[0].df

Returns the following: screen shot 2019-02-06 at 12 29 37 am

anakin87 commented 5 years ago

Try with: tables=camelot.read_pdf("02012018.pdf",flavor='stream',table_areas=['150,425,450,375'],split_text=True,flag_size=True)

The default flavor='lattice' uses demarcated lines between cells to parse tables. In this case, I suggest to use flavor='stream', which is based on whitespaces between cells. The parameter table_areas specifies exact table boundaries.

aperna1 commented 5 years ago

How would the table_areas work given that the table boundaries are different for each table? I guess another way to ask this question: how can I programmatically find those coordinates x1, y1, x2, y2? I've had luck with using:

for row in tables[0].cells: for cell in row: print cell.rb, cell.lt

But then I'm getting coordinates for multiple rows.

aperna1 commented 5 years ago

I was able to get it with the following: Thanks for the help!

col = len(tables[0].cols)-1 row = len(tables[0].rows)-1 x1 = tables[0].cols[0][0] x2 = tables[0].cols[col][1] y1 = tables[0].rows[0][0] y2 = tables[0].rows[row][1]

vinayak-mehta commented 5 years ago

@aperna1 Can you close this issue if your problem was solved?