atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.61k stars 349 forks source link

IndexError: list index out of range #357

Closed heixincai closed 4 years ago

heixincai commented 5 years ago

When i trying to read this pdf,i got this question: i don't how to solve it,thanks AreaPercent-lc-multPageTable1.pdf

D:\LES\Environment\Python\python.exe D:/LES/Python/Code/ReadNoBorderTable.py Traceback (most recent call last): File "D:/LES/Python/Code/ReadNoBorderTable.py", line 7, in tables = camelot.read_pdf(pdfPath,flavor=tableType,strip_text=' .\n',columns=['58,107,139,189,327,258'],split_text=True) File "C:\Users\suyongdeng.RD\AppData\Roaming\Python\Python37\site-packages\camelot\io.py", line 106, in read_pdf layout_kwargs=layout_kwargs, **kwargs) File "C:\Users\suyongdeng.RD\AppData\Roaming\Python\Python37\site-packages\camelot\handlers.py", line 162, in parse layout_kwargs=layout_kwargs) File "C:\Users\suyongdeng.RD\AppData\Roaming\Python\Python37\site-packages\camelot\parsers\stream.py", line 425, in extract_tables cols, rows = self._generate_columns_and_rows(table_idx, tk) File "C:\Users\suyongdeng.RD\AppData\Roaming\Python\Python37\site-packages\camelot\parsers\stream.py", line 321, in _generate_columns_and_rows if self.columns is not None and self.columns[table_idx] != "": IndexError: list index out of range

Process finished with exit code 1

heixincai commented 5 years ago

when i use Excalibur,i find Autodetect Tables will find a smallTable,look at the picture, image if i remove the small, Camelot can extract tables.May be this is the same question,this is anything what i know.Thanks~

pachacamac commented 4 years ago

Pretty sure I have the same problem.

Running with flavor='stream', columns=['62,105,185,252'] and get

File "/home/user/.local/lib/python3.7/site-packages/camelot/io.py", line 117, in read_pdf
    **kwargs
  File "/home/user/.local/lib/python3.7/site-packages/camelot/handlers.py", line 172, in parse
    p, suppress_stdout=suppress_stdout, layout_kwargs=layout_kwargs
  File "/home/user/.local/lib/python3.7/site-packages/camelot/parsers/stream.py", line 458, in extract_tables
    cols, rows = self._generate_columns_and_rows(table_idx, tk)
  File "/home/user/.local/lib/python3.7/site-packages/camelot/parsers/stream.py", line 336, in _generate_columns_and_rows
    if self.columns is not None and self.columns[table_idx] != "":
IndexError: list index out of range

Any way to just ignore tables that don't fall in the provided columns and move on? I get this on a lot of PDFs but the PDFs all have slightly different stuff surrounding the table I'm interested in. So I would just like to move on without failing and later filter the tables that fit my scheme.

PS: I already limited to the right pages but can't / don't know how to give the stream parser a concrete starting and ending point.

pachacamac commented 4 years ago

Somewhat dirty workaround that works for my case:

cols = ['62,105,185,252']
cols *= 128 # <-- workaround: just make sure to have enough of the same col set for all tables that will be discovered. e.g. ['62,105,185,252', '62,105,185,252', .....]
camelot.read_pdf(pdf_file, flavor='stream', columns=cols)

@heixincai let me know if it helps you too :)

vinayak-mehta commented 4 years ago

@heixincai @pachacamac If you know the approximate location of the table in your PDF (assuming the table always lies in this general area in all PDFs that you have), you can specify table_regions to make camelot look for tables in only these regions.

pachacamac commented 4 years ago

@vinayak-mehta for me the problem is that I have PDFs where I'm interested in tables by structure (same columns etc) but different height, y-position, etc. on multiple pages (unknown number of pages)

vinayak-mehta commented 4 years ago

Perhaps, we can put in another filter to weed out tables which do not have a certain width/height as a parameter inside the library.

heixincai commented 4 years ago

sorry,I have been busy with my project these days.My solution is to just get the location information of the entire PDF page.Then filter the parsed data.Maybe my method is just for myself,But the problem has been solved. Thanks~

pachacamac commented 4 years ago

Has this been fixed now?

vinayak-mehta commented 4 years ago

@pachacamac Opened it here https://github.com/camelot-dev/camelot/issues/50

helpgodsg commented 9 months ago

@pachacamac Great. It's really work.