extractText() in PyPDF4 not working while working in PyPDF2

michd89 commented 5 years ago

I have the following file: zen_of_python_corrupted.pdf According to the PDF's internal code the text content is somehow corrupted/compressed/differently encoded. However it works fine when opened with a PDF viewer.

Now I want to extract the text in Python. With PyPDF2 it looks like this:

import PyPDF2
reader = PyPDF2.PdfFileReader('zen_of_python_corrupted.pdf')
for pagenum in range(reader.getNumPages()):
    page = reader.getPage(pagenum)
    text = page.extractText()
    print(text)

And indeed it prints me the Zen of Python.

With PyPDF4 it is:

import PyPDF4
reader = PyPDF4.PdfFileReader('zen_of_python_corrupted.pdf')
for pagenum in range(reader.numPages):
    page = reader.getPage(pagenum)
    text = page.extractText()
    print(text)

But there I only get: Error -3 while decompressing data: incorrect data check

Since I don't find this particular error message within the code of PyPDF4 I consider that the error lies within a third party library. But still I find it odd that it works on the older PyPDF2. Do you have any idea about this? Does it work on your systems if you try it out?

bigcats01 commented 4 years ago

pdfobject = open('test.pdf','rb')

pdfReader = PyPDF4.PdfFileReader(pdfobject,strict=False)
pdfWriter = PyPDF4.PdfFileWriter()
#print(bdc)
for pageNum in range(pdfReader.numPages):
    pageObj = pdfReader.getPage(pageNum)

    TEST = pageObj.extractText()
    print(TEST)
    if str('complex') in TEST:
    ......

    print('complex')
    pdfWriter.addPage(pageObj)

areneededtoobtainaﬁnalresulttheyarenormallyincludedinfull;thisshould enabletheinstructortodeterminewhetherastudent’sincorrectanswerisdueto amisunderstandingofprinciplesortoatechnicalerror. Inallnewpublications,onpaperoronawebsite,errorsandtypographical mistakesarevirtuallyunavoidableandwewouldbegratefultoanyinstructor whobringsinstancestoourattention. KenRiley,kfr1000@cam.ac.uk, MichaelHobson,mph@mrao.cam.ac.uk, Cambridge,2006 xx


PdfStreamError                            Traceback (most recent call last)

<ipython-input-5-d2518890f7f6> in <module>
      9     pageObj = pdfReader.getPage(pageNum)
     10 
---> 11     TEST = pageObj.extractText()
     12     print(TEST)
     13     if str('complex') in TEST:

~\Anaconda\lib\site-packages\PyPDF4\pdf.py in extractText(self)
   2659         content = self["/Contents"].getObject()
   2660         if not isinstance(content, ContentStream):
-> 2661             content = ContentStream(content, self.pdf)
   2662         # Note: we check all strings are TextStringObjects.  ByteStringObjects
   2663         # are strings where the byte->string encoding was unknown, so adding

~\Anaconda\lib\site-packages\PyPDF4\pdf.py in __init__(self, stream, pdf)
   2739         else:
   2740             stream = BytesIO(b_(stream.getData()))
-> 2741         self.__parseContentStream(stream)
   2742 
   2743     def __parseContentStream(self, stream):

~\Anaconda\lib\site-packages\PyPDF4\pdf.py in __parseContentStream(self, stream)
   2771                     peek = stream.read(1)
   2772             else:
-> 2773                 operands.append(readObject(stream, None))
   2774 
   2775     def _readInlineImage(self, stream):

~\Anaconda\lib\site-packages\PyPDF4\generic.py in readObject(stream, pdf)
     75     elif idx == 5:
     76         # string object
---> 77         return readStringFromStream(stream)
     78     elif idx == 6:
     79         # null object

~\Anaconda\lib\site-packages\PyPDF4\generic.py in readStringFromStream(stream)
    332         if not tok:
    333             # stream has truncated prematurely
--> 334             raise PdfStreamError("Stream has ended unexpectedly")
    335         if tok == b_("("):
    336             parens += 1

PdfStreamError: Stream has ended unexpectedly

extractText() works but seems to struggle handling particular things inside the text.

Schaekermann commented 2 months ago

I have Python 3.11 and pypdf installed. pip freeze
pypdf==4.1.0

In case others struggle with the same task. Here's what worked for me with a correct pdf: Here's the documentation

from pypdf import PdfReader

reader = PdfReader('zen_of_python_corrupted.pdf')
for pagenum in range(reader._get_num_pages()):
    page = reader.pages[pagenum]
    text = page.extract_text()
    print(text)

claird / PyPDF4

extractText() in PyPDF4 not working while working in PyPDF2 #31