euske / pdfminer

Python PDF Parser (Not actively maintained). Check out pdfminer.six.
https://github.com/pdfminer/pdfminer.six
MIT License
5.25k stars 1.13k forks source link

list index out of range error #35

Open SharmileeS opened 10 years ago

SharmileeS commented 10 years ago

I get this error both from cmd tool pdf2txt and from code:

caching=caching, check_extractable=True):

File "C:\Python27\lib\site-packages\pdfminer\pdfpage.py", line 123, in get_pag es doc = PDFDocument(parser, caching=caching) File "C:\Python27\lib\site-packages\pdfminer\pdfdocument.py", line 309, in i nit xref.load(parser) File "C:\Python27\lib\site-packages\pdfminer\pdfdocument.py", line 194, in loa d objid1 = objs[index*2] IndexError: list index out of range

euske commented 10 years ago

Do you get this error on every pdf? Can I have the pdf that causes this problem?

SharmileeS commented 10 years ago

How do i attach pdf here?

euske commented 10 years ago

I don't think you can. Upload somewhere else and post a link to it.

alisufian commented 10 years ago

Does not happen on every pdf just on some. Here's a link to one of the pdf's showing this problem. http://webapp.psc.state.md.us/Intranet/Casenum/NewIndex3_VOpenFile.cfm?filepath=C:\Casenum\9200-9299\9208\Item_171\\Ex.D-smartmeterinstallationsfires.pdf

HoldenCaulfieldRye commented 10 years ago

I am getting this problem too. Has anyone figured out how to fix it?

night-crawler commented 10 years ago

The same issue.

t-kopp commented 10 years ago

Hi, did you have a chance to look into this? Do you need more pdfs to reproduce the issues or any other help with testing?

bauer1j commented 10 years ago

I am also experiencing this problem.

euske commented 10 years ago

Sorry for the late reply. Commit b589da51b7bd0ea97597fc8f40cf1e68115e5b94 have fixed this, so the latest revision shouldn't have this problem.

t-kopp commented 10 years ago

Thanks for the hint - running from latest git version now. The files don't throw errors anymore now, but produce one char per line for the complete file when running pdf2text -M 500 -L 13. Is there any workaround for this or is it not possible to get proper output on those files(Your commit comment said 'malformed PDFs')

euske commented 10 years ago

It's because the characters is a part of an embedded object, which pdf2txt avoid performing the layout analysis. To force it to every object, try adding -A option.

youngspring1 commented 5 years ago

I get this error when I use pdfplumber. python 3.6.5/pdfminer.six==20170720/pdfplumber==0.5.10

File "/Users/xuyangchun/.pyenv/versions/evaluation365/lib/python3.6/site-packages/pdfminer/pdfdocument.py", line 661, in _getobj_parse objid1 = x[-2] IndexError: list index out of range