euske / pdfminer

Python PDF Parser (Not actively maintained). Check out pdfminer.six.
https://github.com/pdfminer/pdfminer.six
MIT License
5.26k stars 1.13k forks source link

Failure to parse a PDF file from https://www.broadcom.com/collateral/pg/5756M-PG101-R.pdf #118

Open pombredanne opened 9 years ago

pombredanne commented 9 years ago

The file at https://www.broadcom.com/collateral/pg/5756M-PG101-R.pdf fails to be parsed I verified this is the latest Pypi version and with the HEAD version. This small snippet reproduces the error:

wget https://www.broadcom.com/collateral/pg/5756M-PG101-R.pdf
python -c "from pdfminer.pdfparser import PDFParser;p=PDFParser(open('5756M-PG101-R.pdf','rb'));from pdfminer.pdfdocument import PDFDocument;PDFDocument(p)" 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "[...]/local/lib/python2.7/site-packages/pdfminer/pdfdocument.py", line 575, in __init__
    self._initialize_password(password)
  File "[...]/local/lib/python2.7/site-packages/pdfminer/pdfdocument.py", line 598, in _initialize_password
    raise PDFEncryptionError('Unknown algorithm: param=%r' % param)
pdfminer.pdfdocument.PDFEncryptionError: Unknown algorithm: param={u'EncryptMetadata': False, u'CF': {u'StdCF': {u'Length': 16, u'CFM': /V2, u'AuthEvent': /DocOpen}}, u'O': '\xc6\xa4\xb4%\xed\xda\xe8\x7f&\xd2\x97\x840y\xc7\xbe!N\xdb\xfbw\x0f\x04\xb3iZTn\n\xc3\x93c', u'Filter': /Standard, u'P': -1324, u'Length': 128, u'R': 4, u'U': '\xf3\xa1\xeb\xa5\x19\x8a\x15%\x001\x13CenHO\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', u'V': 4, u'StmF': /StdCF, u'StrF': /StdCF}
pombredanne commented 9 years ago

Note that on Linux using:

 wget https://www.broadcom.com/collateral/pg/5756M-PG101-R.pdf
 pdfseparate -f 1 -l 1 5756M-PG101-R.pdf  5756M-PG101-R-p1.pdf

creates a single page small PDF doc that has the same issue as the full doc

pombredanne commented 9 years ago

@euske any hint of where I could start to help?

chid commented 8 years ago

@pombredanne This works now in the current version of pdfminer

pombredanne commented 8 years ago

@chid It does not work for me on Ubuntu LTS 14.04 with Python 2.7.6. Note that I had made the tests with head and Pypi and both still fail for me. Which environment do you use?

(tmp)pombreda@COMPUTER:~/tmp/pdfminer$ python -c "from pdfminer.pdfparser import PDFParser;p=PDFParser(open('5756M-PG101-R.pdf','rb'));from pdfminer.pdfdocument import PDFDocument;PDFDocument(p)" 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/pombreda/tmp/pdfminer/pdfminer/pdfminer/pdfdocument.py", line 575, in __init__
    self._initialize_password(password)
  File "/home/pombreda/tmp/pdfminer/pdfminer/pdfminer/pdfdocument.py", line 598, in _initialize_password
    raise PDFEncryptionError('Unknown algorithm: param=%r' % param)
pdfminer.pdfdocument.PDFEncryptionError: Unknown algorithm: param={u'EncryptMetadata': False, u'CF': {u'StdCF': {u'Length': 16, u'CFM': /V2, u'AuthEvent': /DocOpen}}, u'O': '\xc6\xa4\xb4%\xed\xda\xe8\x7f&\xd2\x97\x840y\xc7\xbe!N\xdb\xfbw\x0f\x04\xb3iZTn\n\xc3\x93c', u'Filter': /Standard, u'P': -1324, u'Length': 128, u'R': 4, u'U': '\xf3\xa1\xeb\xa5\x19\x8a\x15%\x001\x13CenHO\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', u'V': 4, u'StmF': /StdCF, u'StrF': /StdCF}
chid commented 8 years ago

I'm on Windows on Python 2.7.10. You could try removing the PDF protection with qpdf first.

edit: I just tried it on raspbian and it works fine,

 7986  sudo pip install --upgrade https://github.com/euske/pdfminer/zipball/master
 7987  wget https://www.broadcom.com/collateral/pg/5756M-PG101-R.pdf
 7988  python -c "from pdfminer.pdfparser import PDFParser;p=PDFParser(open('5756M-PG101-R.pdf','rb'));from pdfminer.pdfdocument import PDFDocument;PDFDocument(p)"
pombredanne commented 8 years ago

@chid Thanks but that's very weird. For me for https://github.com/nexB/scancode-toolkit I cannot afford to mandate to have a special version of Python 2.7 on Ubuntu (it comes built in) and I support windows/linux/mac. qpdf could be an option, but it is a native not Python which I like to avoid when possible (even though it is cross platform). That said, this means that the problem lies somewhere in the Python stdlib.... Any idea where? Because this means that this could be patched alright easily in pdfminer.

chid commented 8 years ago

I might have a go at it in ubuntu with default Python and see if it works