camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
2.97k stars 471 forks source link

Misdetection of columns with narrow gap between them #341

Open saidakyuz opened 1 year ago

saidakyuz commented 1 year ago

When I was trying to extract table in screenshot from pdf, you can download from given link, I got the table object which extracted as first and second columns are combined together.

Steps to reproduce the bug

Run the following code after installing Camelot and Ghostscript etc.

Expected behavior

I was expecting to get the table extracted correctly.

Here is the code I used to extract:

tables, layout, dim = self.extract_tables(linescale=30, flag_size= True)
acc_tables = self.filter_acc_tables(tables=tables, min_accuracy=85, max_whitespace=30)

PDF

Screenshots image

image

Environment

Link for PDF

https://www.irf.com/product-info/datasheets/data/irhm9150.pdf

saidakyuz commented 1 year ago

According to my observation, this happens only if the cells are spanned vertically and horizontally and there are some other cells that are not spanned horizontally on the same column with the cells two-dimensional spanned. Somehow each cell in the same row could have opposite values of vspan. (True or False) The issue caused by this attribute, but I still have no solution for it. For example following tables has the issue: image image image