camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
2.96k stars 466 forks source link

Camelot returns tables that contain no text (Where text should be detectable) #337

Open peletiah opened 1 year ago

peletiah commented 1 year ago

I'm trying to extract data from some ~900 certificates. These certificates have an identical visual structure, but are published by different parties. For the majority of files the extraction works. However, for several dozen files, the table-structure returned by Camelot contains only empty strings.

Plotting grid and text shows content is detected (e.g. Table 7 in DS_3663.pdf): DS_3663_table_7_grid

DS_3663_table_7_text

I'm using this command to read the pdf and create the tables: >>> tables=camelot.read_pdf('pdfs/DS_3663.pdf', pages='1-end', line_scale=110, shift_text=[''])

e.g. Table 7 contains this data: >>> tables[7].data [['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', '']]

Here are a few more example pdfs where the extraction fails in an identical manner: DS_885.pdf DS_2481.pdf DS_2083.pdf

Parsing all of these files with pdf2txt.py successfully extracts text, so I assume it should be possible to get a result with Camelot as well.

Environment

I've tried debugging this, but had difficulties understanding the intricate code in the bbox-sections. From what I've figured out, it appears to me that Camelot is unable to marry horizontal_text (Which contain the relevant text) with the line-grid.