claird / PyPDF4

A utility to read and write PDFs with Python
obsolete-https://pythonhosted.org/PyPDF2/
Other
330 stars 61 forks source link

PdfFileReader crashes on a normal PDF file #63

Open askerlee opened 5 years ago

askerlee commented 5 years ago

The PDF file causing error is attached. This one-page file is extracted from a PDF using Acrobat. 1.pdf

When it's opened with PdfFileReader and calls numPages, the script crashes with an exception:

pypdf.utils.PdfReadError: Cannot fetch a free object (id, next gen.) = (8, 0)

But if I decompress this file first using qpdf, there's no error. Seems some sort of rare structure in the low-level.

askerlee commented 5 years ago

Met another file with the same error 😢 Seems not so rare. test.pdf

askerlee commented 5 years ago

Update: these files can be opened normally with pypdf2. Seems it's a bug introduced in pypdf4.

cadu-leite commented 4 years ago

This bug is still valid ? cause I did merge those files with another

jucajuca commented 3 years ago

Any updates on this? I am also having troubles reading the following file:

101880043.pdf

cadu-leite commented 3 years ago
In [1]: from PyPDF4 import PdfFileReader

In [2]: test_pdf = PdfFileReader(open('test.pdf', 'rb'))

In [3]: One_pdf = PdfFileReader(open('1.pdf', 'rb'))

In [4]: one_pdf = PdfFileReader(open('1.pdf', 'rb'))

In [5]: numbers = PdfFileReader(open('101880043.pdf', 'rb'))
---------------------------------------------------------------------------
PdfReadError                              Traceback (most recent call last)
<ipython-input-5-dfaa3ab77f43> in <module>
----> 1 numbers = PdfFileReader(open('101880043.pdf', 'rb'))

~/.virtualenvs/pypdf4_issue_63/lib/python3.9/site-packages/PyPDF4/pdf.py in __init__(self, stream, strict, warndest, overwriteWarnings)
   1146             stream = BytesIO(b_(fileobj.read()))
   1147             fileobj.close()
-> 1148         self.read(stream)
   1149         self.stream = stream
   1150

~/.virtualenvs/pypdf4_issue_63/lib/python3.9/site-packages/PyPDF4/pdf.py in read(self, stream)
   1964                     continue
   1965                 # no xref table found at specified location
-> 1966                 raise utils.PdfReadError("Could not find xref table at specified location")
   1967         #if not zero-indexed, verify that the table is correct; change it if necessary
   1968         if self.xrefIndex and not self.strict:

PdfReadError: Could not find xref table at specified location

In [6]: test_pdf.getNumPages()
Out[6]: 1

In [7]: one_pdf.getNumPages()
Out[7]: 1

A got an error using @jucajuca PDF sample, but the others seems to work .

Using ...

You may post some code samples ...