camelot-dev / excalibur

A web interface to extract tabular data from PDFs
https://excalibur-py.readthedocs.io
MIT License
1.59k stars 231 forks source link

Stream does not detect similar tables in the same document #156

Open daniambrosio opened 2 years ago

daniambrosio commented 2 years ago

Using Lattice on this bank statement pdf results in no tables found. I thought using backgorund = True for Lattice would work, but no. So I tried with Stream. And it works for some pages. For others, it gets messy. I mean, the text from the second column is returned inside the first column (the one with the dates).

This one works fine (print of the PDF)

image

Corresponding print of the extracted tables:

image

This one gets messy:

image

Corresponding print of the extracted tables:

image
daniambrosio commented 2 years ago

Anyone would suggest any approach here on this case?