camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
2.96k stars 466 forks source link

not able to identify dataframe from bank statements pdf #357

Open arreyaar opened 1 year ago

arreyaar commented 1 year ago

tabula and camelot both are not able to extract tables from bank statements pdf like the one sample attached 1) the area for the table is not fixed i.e. co-ordinates are changed for every months statement 2) lattice and stream mode both not working and gives always empty dataframe with column names C:\Users\vikas\Desktop\GreenariaSociety\Tools>python sample.py <class 'pandas.core.frame.DataFrame'> Empty DataFrame Columns: [DATE, MODE, PARTICULARS, DEPOSITS, WITHDRAWALS, BALANCE] Index: [] 3) also in case some columns are having multiple lines in the values for e.g. PARTICULARS/DESCRIPTIONS from bank statements the table cell data is not correctly extracted and it is spread across other cells/rows

sample code used as below:- df = tabula.read_pdf(pdf_path, pages="1",stream=True,multiple_tables=True)[0] #//tried lattice, pages='all', etc. print(type(df)) print(df)

sample_bank_statement.pdf

kdshreyas commented 12 months ago

hello @arreyaar ,

I tried extracting same using Camelot attached output for your reference: sample_output.csv

import camelot
inputpdf = r'D:\Personal\ope_source\sample_bank_statement.pdf'
tables = camelot.read_pdf(inputpdf, pages = str(1), flavor='stream', edge_tol=500)
tables[1].df