euske / pdfminer

Python PDF Parser (Not actively maintained). Check out pdfminer.six.
https://github.com/pdfminer/pdfminer.six
MIT License
5.24k stars 1.13k forks source link

Exception in PDF to text extraction #59

Open the-happy-hippo opened 10 years ago

the-happy-hippo commented 10 years ago

When trying to parse PDF at http://www.ada.gov/hospcombrprt.pdf, I get the following error:

pdfdocument.py", line 348, in _initialize_password
    raise PDFEncryptionError('Unknown algorithm: param=%r' % param)
pdfminer.pdfdocument.PDFEncryptionError: Unknown algorithm: param={'CF': {'StdCF': {'Length': 16, 'CFM': /AESV2, 'AuthEvent': /DocOpen}}, 'O': '~?\x05\xaa\x169\xf9\x1f\xb0\x15\xce\x10\x81\x07\xd5\xb3\xf3&\xceB\xe3\xa6\x85\xa4l\xfd1\\\xb2\xf4l\xb9', 'Filter': /Standard, 'P': -1324, 'Length': 128, 'R': 4, 'U': '\x8c\x11\xa0\xa9\xd7\xb1C\x8c<\x92\x9fN\x94}\x98\x91\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'V': 4, 'StmF': /StdCF, 'StrF': /StdCF}
euske commented 10 years ago

You need to install pycrypto to handle a certain type of encryption.

the-happy-hippo commented 10 years ago

I've installed pycrypto and still get the same exception (unknown algorithm). Yet please note I hadn't got an ImportError on pycrypto so I don't get how this might be related to the original exception.

euske commented 10 years ago

You will not see ImportError even if pycrypto is not installed. PDFMiner simply falls back without using it when it's not available (hence the error). You probably need to tweak pdfdocument.py to see if it's correctly imported. Again, this feature is only recently contributed so there might be still some other caveats.

skumar34 commented 7 years ago

Hello euske, I am working on pdfminer and for some pdf files getting the same error. Even after installing the pycrypto i am getting the same error. Could you please let me know what should be tweaked in pdfdocument.py?

thedapperlabel commented 7 years ago

@joncodo Could you please elaborate on how it worked for you?