atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.65k stars 357 forks source link

read_pdf fails with an index out of range #306

Closed sweco-sekrsv closed 4 years ago

sweco-sekrsv commented 5 years ago

Camelot 0.7.2 The pdf-file is also attached.

running this: tables = camelot.read_pdf('3713-B31-24-04401_error.pdf',flavor='lattice', line_scale=30)

results in this error:

File "camelot_test04.py", line 205, in <module>
    tables = camelot.read_pdf(filename,flavor='lattice', line_scale=30)
  File "C:\Users\seks13473\AppData\Local\Programs\Python\Python36\lib\site-packages\camelot\io.py", line 106, in read_pdf
    layout_kwargs=layout_kwargs, **kwargs)
  File "C:\Users\seks13473\AppData\Local\Programs\Python\Python36\lib\site-packages\camelot\handlers.py", line 156, in parse
    self._save_page(self.filepath, p, tempdir)
  File "C:\Users\seks13473\AppData\Local\Programs\Python\Python36\lib\site-packages\camelot\handlers.py", line 109, in _save_page
    layout, dim = get_page_layout(fpath)
  File "C:\Users\seks13473\AppData\Local\Programs\Python\Python36\lib\site-packages\camelot\utils.py", line 689, in get_page_layout
    interpreter.process_page(page)
  File "C:\Users\seks13473\AppData\Local\Programs\Python\Python36\lib\site-packages\pdfminer\pdfinterp.py", line 852, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "C:\Users\seks13473\AppData\Local\Programs\Python\Python36\lib\site-packages\pdfminer\pdfinterp.py", line 862, in render_contents
    self.init_resources(resources)
  File "C:\Users\seks13473\AppData\Local\Programs\Python\Python36\lib\site-packages\pdfminer\pdfinterp.py", line 362, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "C:\Users\seks13473\AppData\Local\Programs\Python\Python36\lib\site-packages\pdfminer\pdfinterp.py", line 197, in get_font
    font = PDFTrueTypeFont(self, spec)
  File "C:\Users\seks13473\AppData\Local\Programs\Python\Python36\lib\site-packages\pdfminer\pdffont.py", line 594, in __init__
    PDFSimpleFont.__init__(self, descriptor, widths, spec)
  File "C:\Users\seks13473\AppData\Local\Programs\Python\Python36\lib\site-packages\pdfminer\pdffont.py", line 560, in __init__
    CMapParser(self.unicode_map, BytesIO(strm.get_data())).run()
  File "C:\Users\seks13473\AppData\Local\Programs\Python\Python36\lib\site-packages\pdfminer\cmapdb.py", line 287, in run
    self.nextobject()
  File "C:\Users\seks13473\AppData\Local\Programs\Python\Python36\lib\site-packages\pdfminer\psparser.py", line 616, in nextobject
    self.do_keyword(pos, token)
  File "C:\Users\seks13473\AppData\Local\Programs\Python\Python36\lib\site-packages\pdfminer\cmapdb.py", line 393, in do_keyword
    self.cmap.add_cid2unichr(s1+i, code[i])
IndexError: list index out of range

Any ideas? 3713-B31-24-04401_error.pdf

vinayak-mehta commented 5 years ago

Looks like a pdfminer bug. Let me try to reproduce it.

sweco-sekrsv commented 5 years ago

Thanks! I have a few pdf' files that generate this error. I can provide them if that helps.

vinayak-mehta commented 5 years ago

Yes, please do.

sweco-sekrsv commented 5 years ago

I'm attaching another 10 files that seems to have the same problem. list_index_bug_pdfs.zip