I'm trying to extract data from some ~900 certificates. These certificates have an identical visual structure, but are published by different parties. For the majority of files the extraction works. However, for several dozen files, the table-structure returned by Camelot contains only empty strings.
Plotting grid and text shows content is detected (e.g. Table 7 in DS_3663.pdf):
I'm using this command to read the pdf and create the tables:
>>> tables=camelot.read_pdf('pdfs/DS_3663.pdf', pages='1-end', line_scale=110, shift_text=[''])
Parsing all of these files with pdf2txt.py successfully extracts text, so I assume it should be possible to get a result with Camelot as well.
Environment
OS: Ubuntu 22.04.1 LTS
Python version: 3.10.6
Numpy version: 1.23.4
OpenCV version: 4.6.0.66
Ghostscript version: 9.55.0
Camelot version: 0.9.0
I've tried debugging this, but had difficulties understanding the intricate code in the bbox-sections. From what I've figured out, it appears to me that Camelot is unable to marry horizontal_text (Which contain the relevant text) with the line-grid.
I'm trying to extract data from some ~900 certificates. These certificates have an identical visual structure, but are published by different parties. For the majority of files the extraction works. However, for several dozen files, the table-structure returned by Camelot contains only empty strings.
Plotting
grid
andtext
shows content is detected (e.g. Table 7 in DS_3663.pdf):I'm using this command to read the pdf and create the tables:
>>> tables=camelot.read_pdf('pdfs/DS_3663.pdf', pages='1-end', line_scale=110, shift_text=[''])
e.g. Table 7 contains this data:
>>> tables[7].data
[['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', '']]
Here are a few more example pdfs where the extraction fails in an identical manner: DS_885.pdf DS_2481.pdf DS_2083.pdf
Parsing all of these files with
pdf2txt.py
successfully extracts text, so I assume it should be possible to get a result with Camelot as well.Environment
I've tried debugging this, but had difficulties understanding the intricate code in the bbox-sections. From what I've figured out, it appears to me that Camelot is unable to marry
horizontal_text
(Which contain the relevant text) with the line-grid.