atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.62k stars 350 forks source link

NotImplementedError: only algorithm code 1 and 2 are supported #325

Closed RealDataLLC closed 4 years ago

RealDataLLC commented 5 years ago

Having trouble running this code on my mac. Using Conda virtual env and installed using conda. Pdf is not password protected.

import camelot import pandas as pd import re import numpy as np table1 = camelot.read_pdf('IEEJ - 2019 - Outlook.pdf')


NotImplementedError Traceback (most recent call last)

in ----> 1 table1 = camelot.read_pdf('IEEJ - 2019 - Outlook.pdf')#, pages = ex_page, password = None)#, area = (left, 112, right,112+ 90)) 2 table1 /anaconda3/envs/tensorflow/lib/python3.6/site-packages/camelot/io.py in read_pdf(filepath, pages, password, flavor, suppress_stdout, layout_kwargs, **kwargs) 104 kwargs = remove_extra(kwargs, flavor=flavor) 105 tables = p.parse(flavor=flavor, suppress_stdout=suppress_stdout, --> 106 layout_kwargs=layout_kwargs, **kwargs) 107 return tables /anaconda3/envs/tensorflow/lib/python3.6/site-packages/camelot/handlers.py in parse(self, flavor, suppress_stdout, layout_kwargs, **kwargs) 153 with TemporaryDirectory() as tempdir: 154 for p in self.pages: --> 155 self._save_page(self.filepath, p, tempdir) 156 pages = [os.path.join(tempdir, 'page-{0}.pdf'.format(p)) 157 for p in self.pages] /anaconda3/envs/tensorflow/lib/python3.6/site-packages/camelot/handlers.py in _save_page(self, filepath, page, temp) 98 infile = PdfFileReader(fileobj, strict=False) 99 if infile.isEncrypted: --> 100 infile.decrypt(self.password) 101 fpath = os.path.join(temp, 'page-{0}.pdf'.format(page)) 102 froot, fext = os.path.splitext(fpath) /anaconda3/envs/tensorflow/lib/python3.6/site-packages/PyPDF2/pdf.py in decrypt(self, password) 1985 self._override_encryption = True 1986 try: -> 1987 return self._decrypt(password) 1988 finally: 1989 self._override_encryption = False /anaconda3/envs/tensorflow/lib/python3.6/site-packages/PyPDF2/pdf.py in _decrypt(self, password) 1994 raise NotImplementedError("only Standard PDF encryption handler is available") 1995 if not (encrypt['/V'] in (1, 2)): -> 1996 raise NotImplementedError("only algorithm code 1 and 2 are supported") 1997 user_password, key = self._authenticateUserPassword(password) 1998 if user_password: NotImplementedError: only algorithm code 1 and 2 are supported
pachacamac commented 4 years ago

Hate to be that guy but any update on this? Totally at a loss here. If it helps I'm running on Linux.

> pdftk "that.pdf" dump_data
WARNING: The creator of the input PDF:
   that.pdf
   has set an owner password (which is not required to handle this PDF).
   You did not supply this password. Please respect any copyright.
InfoBegin
InfoKey: Creator
InfoValue: IDM
InfoBegin
InfoKey: CreationDate
InfoValue: D:20180607145130+02'00'
InfoBegin
InfoKey: Producer
InfoValue: PDFlib+PDI 7.0.2 (COM/Win32)
InfoBegin
InfoKey: Author
InfoValue: IntegraDM
PdfID0: 939f2420294646f31f041d74020f2c30
PdfID1: 939f2420294646f31f041d74020f2c30
NumberOfPages: 10
PageMediaBegin
PageMediaNumber: 1
PageMediaRotation: 0
PageMediaRect: 0 0 595.2 841.92
PageMediaDimensions: 595.2 841.92
PageMediaBegin
PageMediaNumber: 2
PageMediaRotation: 0
PageMediaRect: 0 0 595.2 841.92
PageMediaDimensions: 595.2 841.92
PageMediaBegin
PageMediaNumber: 3
PageMediaRotation: 0
PageMediaRect: 0 0 595.2 841.92
PageMediaDimensions: 595.2 841.92
PageMediaBegin
PageMediaNumber: 4
PageMediaRotation: 0
PageMediaRect: 0 0 595.2 841.92
PageMediaDimensions: 595.2 841.92
PageMediaBegin
PageMediaNumber: 5
PageMediaRotation: 0
PageMediaRect: 0 0 595.2 841.92
PageMediaDimensions: 595.2 841.92
PageMediaBegin
PageMediaNumber: 6
PageMediaRotation: 0
PageMediaRect: 0 0 595.2 841.92
PageMediaDimensions: 595.2 841.92
PageMediaBegin
PageMediaNumber: 7
PageMediaRotation: 0
PageMediaRect: 0 0 595.2 841.92
PageMediaDimensions: 595.2 841.92
PageMediaBegin
PageMediaNumber: 8
PageMediaRotation: 0
PageMediaRect: 0 0 595.2 841.92
PageMediaDimensions: 595.2 841.92
PageMediaBegin
PageMediaNumber: 9
PageMediaRotation: 0
PageMediaRect: 0 0 595.2 841.92
PageMediaDimensions: 595.2 841.92
PageMediaBegin
PageMediaNumber: 10
PageMediaRotation: 0
PageMediaRect: 0 0 595.2 841.92
PageMediaDimensions: 595.2 841.92

and

> file that.pdf
that.pdf: PDF document, version 1.7

unfortunately I can not share the original pdf as it contains sensitive data but reading it works fine and https://github.com/jcushman/pdfquery reads and handles it just fine.

myleshk commented 4 years ago

Same here. Please fix

boranaf commented 4 years ago

Hi, here is a file that gives the same error "MGROS-2017Y.pdf only algorithm code 1 and 2 are supported"

MGROS-2017Y.pdf

myleshk commented 4 years ago

I recognized that this is an issue of the dependancy PyPDF2 from 2015.

boranaf commented 4 years ago

thanks for your feedback which prompted me to retry you are right @myleshk

vinayak-mehta commented 4 years ago

Is the PDF encrypted? Can you try decrypting it using qpdf and then try again?

pachacamac commented 4 years ago

@vinayak-mehta mine is not encrypted. And as I said pdfquery another Python library can read it just fine.

vinayak-mehta commented 4 years ago

I understand. Looked at pdfquery, it looks nice! Interestingly, it also uses pdfminer under the hood. I'll look into this over the weekend.

vinayak-mehta commented 4 years ago

Sorry for the late responses to issues.

alexxxkorolev commented 4 years ago

Camelot does not support Acrobat files version 6 or higher. Convert your PDF file to a lower version (I used Acrobat 4.0 PDF 1.3) just through any converter online. The problem should be solved!

pachacamac commented 4 years ago

@alexxxkorolev thanks for the tip! Any suggestion for a command line tool, preferably Linux, that can downgrade PDFs? The problem is that I use camelot in an automated pipeline and can not manually convert PDFs.

manohar9600 commented 3 years ago

https://github.com/mstamy2/PyPDF2/issues/378#issuecomment-689585779 using pikepdf, solved for me.