atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.61k stars 349 forks source link

Same page exported multiple times #449

Open dml5 opened 3 years ago

dml5 commented 3 years ago

I am exporting a large PDF to tables then exporting them to csv but I am getting multiple pages. So if the PDF is 1000 pages long, the output expected is 1000 single csv -- one for each page. The original PDF does not have duplicate tables or more than one table on a page. How can I stop this? I can delete the extra pages but I don't want to delete them if they are required and I can't go through 1000 pages manually every time I run it to check.

from_pdf-page-46-table-1.csv from_pdf-page-46-table-2.csv from_pdf-page-46-table-3.csv from_pdf-page-46-table-4.csv

import pandas as pd
import glob
import camelot

tables = camelot.read_pdf('C:\\temp\\to_csv.pdf', pages='1-1000', row_tol=4, flavor='stream')
tables.export('c:\\temp\\from_pdf.csv', f='csv', compress=False)

filepaths = glob.glob('C:\\temp\\*.csv')
df = pd.concat(map(pd.read_csv, filepaths))
df.to_excel("c:\\temp\\from_pdf.xlsx")

Edit: I also tried flavor='lattice' and got this error: error: C:\ci\opencv_1512688052760\work\modules\core\src\matrix.cpp:436: error: (-215) u != 0 in function cv::Mat::create

I don't have a c:\ci directory on my computer.