Open arreyaar opened 1 year ago
hello @arreyaar ,
I tried extracting same using Camelot attached output for your reference: sample_output.csv
import camelot
inputpdf = r'D:\Personal\ope_source\sample_bank_statement.pdf'
tables = camelot.read_pdf(inputpdf, pages = str(1), flavor='stream', edge_tol=500)
tables[1].df
tabula and camelot both are not able to extract tables from bank statements pdf like the one sample attached 1) the area for the table is not fixed i.e. co-ordinates are changed for every months statement 2) lattice and stream mode both not working and gives always empty dataframe with column names C:\Users\vikas\Desktop\GreenariaSociety\Tools>python sample.py <class 'pandas.core.frame.DataFrame'> Empty DataFrame Columns: [DATE, MODE, PARTICULARS, DEPOSITS, WITHDRAWALS, BALANCE] Index: [] 3) also in case some columns are having multiple lines in the values for e.g. PARTICULARS/DESCRIPTIONS from bank statements the table cell data is not correctly extracted and it is spread across other cells/rows
sample code used as below:- df = tabula.read_pdf(pdf_path, pages="1",stream=True,multiple_tables=True)[0] #//tried lattice, pages='all', etc. print(type(df)) print(df)
sample_bank_statement.pdf