atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.61k stars 349 forks source link

Same table is extracted twice from a pdf #406

Open maheshmarch213 opened 4 years ago

maheshmarch213 commented 4 years ago

I am trying to extract tables from a multiple page PDF file using camelot-py v0.7.3.

So far it has been the best pdf reader tool for me. I just needed to read pdf line by line and detect table manually. I tried many other tools such as tabula, PyPDF2/4, pdfminer, etc. Some of them could not detect the text itself properly and some of them disturbed the word sequences or spacing between the columns.

But camelot-py gave me the data in the format which is best suited for my application.

In the process of extracting data from the pdf using camelot-py, it detects all tables' data almost very well except few errors:

  1. It is grouping multiple tables together in same 'TableList' element. But I am able separate these grouped tables. So no need to worry here.
  2. Last table from these grouped tables is repeated in a saparate 'TableList' element. This repeatition is the main concern for me. This repeated table comes before the grouped tables.

The code used for above process is as below:

tables = camelot.read_pdf('test.pdf', pages='1-end', flavor='stream')
tables.export('foo.csv', f='csv', compress=False)

for table in tables:
    table_df = table.df
    # Code to parse data from tables in each element converted into datafram

Input PDF File: I can't share the pdf files because of sensitive data. But here are some details which will give you good idea about its structure: All pages contain only tables. Page 1: Contains Table1 which contain customer's info. Table 2 to 4 with same structure Page 2: Contains some rows from Table 4 and Table 5 to 7 with same structure as Table 2 Page 3: Table 8 to10 with same structure as Table 2

Output CSV files: foo-page-1-table-1: Contain Table 1 foo-page-1-table-2: Contain last row (repeated) from Table 1 and Table 2 to 4 foo-page-2-table-1: Contain Table 7 (repeated with First row missing) foo-page-2-table-2: Contain some end rows from Table 4 and Table 5 to 7 foo-page-3-table-1: Contain Table 10 (repeated fully) foo-page-3-table-2: Contain Table 8 to 10

Why camelot-py is repeating some tables? Is there any way to handle this repeatition?