atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.64k stars 354 forks source link

Unable to get same format of result from two bank statements of the same bank #263

Closed pksingh210 closed 5 years ago

pksingh210 commented 5 years ago

Hi Vinayak, I am trying to process bank statement to get transaction tables. I use CBA bank statement of two users.

  1. edited-CBASavings.pdf
  2. edited-CBASavings-S.pdf

edited-CBASavings.pdf is processed properly with the following code:

tables = camelot.read_pdf(filePath, flavor='stream', split_text=True)

But other bank statement edited-CBASavings-S produces bad result with mixing of two columns data in one column of csv file. pl see result below. I wish to have single code for a bank at least for processing all users' bank statement.:

image

could you help to fix this issue. editecd-CBASavings-S.pdf edited_CBASavings.pdf

vinayak-mehta commented 5 years ago

@pksingh210 I'll try to debug it this week.

I see that the table follows the same structure across all PDFs, so you should be able to specify columns and use the same value for all PDFs.

pksingh210 commented 5 years ago

Hi Vinayak,

Not all bank PDF follow same structure. Even bank statement of same bank with different accounts have different structure. But challenge I am getting is that the structure of same bank and same account bank statement of one customer is processed well but does not stand same for other customer.

Multiple pages of some bank statement is taken in processing but not for other bank statement. It looks like there are some bugs in the solution. could you look in to that.

pksingh210 commented 5 years ago

What technique/tool you apply to get x coordinates of columns in PDF tables..

vinayak-mehta commented 5 years ago

What technique/tool you apply to get x coordinates of columns in PDF tables..

Columns are guessed based on these methods.

vinayak-mehta commented 5 years ago

Closing this, please reopen if your issue wasn't fixed.