Open pombredanne opened 9 years ago
Note that on Linux using:
wget https://www.broadcom.com/collateral/pg/5756M-PG101-R.pdf
pdfseparate -f 1 -l 1 5756M-PG101-R.pdf 5756M-PG101-R-p1.pdf
creates a single page small PDF doc that has the same issue as the full doc
@euske any hint of where I could start to help?
@pombredanne This works now in the current version of pdfminer
@chid It does not work for me on Ubuntu LTS 14.04 with Python 2.7.6. Note that I had made the tests with head and Pypi and both still fail for me. Which environment do you use?
(tmp)pombreda@COMPUTER:~/tmp/pdfminer$ python -c "from pdfminer.pdfparser import PDFParser;p=PDFParser(open('5756M-PG101-R.pdf','rb'));from pdfminer.pdfdocument import PDFDocument;PDFDocument(p)"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/pombreda/tmp/pdfminer/pdfminer/pdfminer/pdfdocument.py", line 575, in __init__
self._initialize_password(password)
File "/home/pombreda/tmp/pdfminer/pdfminer/pdfminer/pdfdocument.py", line 598, in _initialize_password
raise PDFEncryptionError('Unknown algorithm: param=%r' % param)
pdfminer.pdfdocument.PDFEncryptionError: Unknown algorithm: param={u'EncryptMetadata': False, u'CF': {u'StdCF': {u'Length': 16, u'CFM': /V2, u'AuthEvent': /DocOpen}}, u'O': '\xc6\xa4\xb4%\xed\xda\xe8\x7f&\xd2\x97\x840y\xc7\xbe!N\xdb\xfbw\x0f\x04\xb3iZTn\n\xc3\x93c', u'Filter': /Standard, u'P': -1324, u'Length': 128, u'R': 4, u'U': '\xf3\xa1\xeb\xa5\x19\x8a\x15%\x001\x13CenHO\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', u'V': 4, u'StmF': /StdCF, u'StrF': /StdCF}
I'm on Windows on Python 2.7.10. You could try removing the PDF protection with qpdf first.
edit: I just tried it on raspbian and it works fine,
7986 sudo pip install --upgrade https://github.com/euske/pdfminer/zipball/master
7987 wget https://www.broadcom.com/collateral/pg/5756M-PG101-R.pdf
7988 python -c "from pdfminer.pdfparser import PDFParser;p=PDFParser(open('5756M-PG101-R.pdf','rb'));from pdfminer.pdfdocument import PDFDocument;PDFDocument(p)"
@chid Thanks but that's very weird. For me for https://github.com/nexB/scancode-toolkit I cannot afford to mandate to have a special version of Python 2.7 on Ubuntu (it comes built in) and I support windows/linux/mac. qpdf
could be an option, but it is a native not Python which I like to avoid when possible (even though it is cross platform).
That said, this means that the problem lies somewhere in the Python stdlib.... Any idea where? Because this means that this could be patched alright easily in pdfminer.
I might have a go at it in ubuntu with default Python and see if it works
The file at https://www.broadcom.com/collateral/pg/5756M-PG101-R.pdf fails to be parsed I verified this is the latest Pypi version and with the HEAD version. This small snippet reproduces the error: