atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.64k stars 354 forks source link

some tables are skipped #288

Closed retsyo closed 5 years ago

retsyo commented 5 years ago

you can find the pdf file, which is exported from a DOCX file by wps

the following code says there is only 2 tables on page 2. In the exported foo.xlsx we can find that tab 2-1 and tab 2-2 is missing.

import camelot
tables = camelot.read_pdf(r'source.pdf', pages='2')
print(tables)

tables.export('foo.xlsx', f='excel') # json, excel, html, sqlite
anakin87 commented 5 years ago

When I try to execute tables = camelot.read_pdf(r'source.pdf', pages='2'), i get the following error: PdfReadError: EOF marker not found

retsyo commented 5 years ago

latest gswin64.exe in ghostscript 9.26, SumatraPDF, foxit reader and many other PDF readers can open the pdf file and render 3 pages without any problem

I am using latest cloned camelot with

Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)] on win32

on windows 7 64 bits

vinayak-mehta commented 5 years ago

@retsyo Since the tables have smaller lines, you need to pass in line_scale. https://camelot-py.readthedocs.io/en/master/user/advanced.html#detect-short-lines I was able to get the first two tables out with a value of 40.

retsyo commented 5 years ago

thanks. In general, does a larger line_scale produce wrong table? Or in other words, if there is a lot of tables in tons of PDF files so as a result it is a hard work to check the extracted result upon original PDF files, if I always use a a very large line_scale, is it safe to extract all the tables? What we will lost?

vinayak-mehta commented 5 years ago

Warning: Making line_scale very large (>150) will lead to text getting detected as lines.

https://camelot-py.readthedocs.io/en/master/user/advanced.html#detect-short-lines