atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.62k stars 350 forks source link

Extracting tables on large PDFs #337

Closed maximeboun closed 5 years ago

maximeboun commented 5 years ago

Hello everyone,

I'm trying to extract tables on a 320 pages PDF with the following code:

import camelot results= camelot.read_pdf('WebScraper/14572.pdf',pages='all',flavor='stream')

However I get an error after waiting a long time:

UserWarning: No tables found in table area 1 [stream.py:346]
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Users\mboun\PycharmProjects\WebProject_WebScraper\venv\lib\site-packages\camelot\io.py", line 106, in read_pdf
    layout_kwargs=layout_kwargs, **kwargs)
  File "C:\Users\mboun\PycharmProjects\WebProject_WebScraper\venv\lib\site-packages\camelot\handlers.py", line 161, in parse
    layout_kwargs=layout_kwargs)
  File "C:\Users\mboun\PycharmProjects\WebProject_WebScraper\venv\lib\site-packages\camelot\parsers\stream.py", line 425, in extract_tables
    cols, rows = self._generate_columns_and_rows(table_idx, tk)
  File "C:\Users\mboun\PycharmProjects\WebProject_WebScraper\venv\lib\site-packages\camelot\parsers\stream.py", line 334, in _generate_columns_and_rows
    ncols = max(set(elements), key=elements.count)
ValueError: max() arg is an empty sequence

The UserWarning occurs at about half of the processing. You'll find the PDF attached. 14572.pdf

Anyone experienced a similar problem or has any clues on how to solve this?

vinayak-mehta commented 5 years ago

@maximeboun Were you able to solve this?

bourliam commented 4 years ago

I have the same issue. Any way to solve it ?