claird / PyPDF4

A utility to read and write PDFs with Python
obsolete-https://pythonhosted.org/PyPDF2/
Other
332 stars 61 forks source link

strange issue with resolvedObjects #11

Open DeliciousHair opened 6 years ago

DeliciousHair commented 6 years ago

Unfortunately, I cannot share the source documents that are causing this problem, so what I'm instead looking for is some hints as to where I may look to find what could be causing this (I took a look at the source for PdfFileReader and nothing is jumping out at me) so that I could create a workaround at a minimum.

Using IPython, this is what I get:

In [1]: import PyPDF4

In [2]: pdf = PyPDF4.PdfFileReader('some_file.pdf')

In [3]: for key, val in pdf.resolvedObjects.items():
   ...:     print(key, val)
   ...:     
(0, 634) {'/DecodeParms': {'/Columns': 3, '/Predictor': 12}, '/Filter': '/FlateDecode', '/ID': [b'<edited out>', b'<edited out>'], '/Index': [614, 22], '/Info': IndirectObject(613, 0), '/Prev': 1820112, '/Root': IndirectObject(615, 0), '/Size': 636, '/Type': '/XRef', '/W': [1, 2, 0]}
(0, 611) {'/DecodeParms': {'/Columns': 4, '/Predictor': 12}, '/Filter': '/FlateDecode', '/ID': [b'<edited out>', b'<edited out>'], '/Info': IndirectObject(613, 0), '/Root': IndirectObject(615, 0), '/Size': 614, '/Type': '/XRef', '/W': [1, 3, 0]}

In [4]: pdf.resolvedObjects
Out[4]: ---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    381                 if cls in self.type_pprinters:
    382                     # printer registered in self.type_pprinters
--> 383                     return self.type_pprinters[cls](obj, self, cycle)
    384                 else:
    385                     # deferred printer

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/lib/pretty.py in inner(obj, p, cycle)
    610                 and not (p.max_seq_length and len(obj) >= p.max_seq_length):
    611             keys = _sorted_for_pprint(keys)
--> 612         for idx, key in p._enumerate(keys):
    613             if idx:
    614                 p.text(',')

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/lib/pretty.py in _enumerate(self, seq)
    284     def _enumerate(self, seq):
    285         """like enumerate, but with an upper limit on the number of items"""
--> 286         for idx, x in enumerate(seq):
    287             if self.max_seq_length and idx >= self.max_seq_length:
    288                 self.text(',')

RuntimeError: dictionary changed size during iteration

In [5]: for key, val in pdf.resolvedObjects.items():
   ...:     print(key, val)
   ...:     
(0, 634) {'/DecodeParms': {'/Columns': 3, '/Predictor': 12}, '/Filter': '/FlateDecode', '/ID': [b'<edited out>', b'<edited out>'], '/Index': [614, 22], '/Info': IndirectObject(613, 0), '/Prev': 1820112, '/Root': IndirectObject(615, 0), '/Size': 636, '/Type': '/XRef', '/W': [1, 2, 0]}
(0, 611) {'/DecodeParms': {'/Columns': 4, '/Predictor': 12}, '/Filter': '/FlateDecode', '/ID': [b'<edited out>', b'<edited out>'], '/Info': IndirectObject(613, 0), '/Root': IndirectObject(615, 0), '/Size': 614, '/Type': '/XRef', '/W': [1, 3, 0]}
(0, 610) {'/Filter': '/FlateDecode', '/First': 6, '/N': 1, '/Type': '/ObjStm'}
(0, 613) {'/CreationDate': "D:20150730143930+10'00'", '/Creator': '28C-1', '/ModDate': "D:20150803093650+10'00'", '/Producer': 'Develop ineo+ 280'}
(0, 615) {'/Metadata': IndirectObject(608, 0), '/OpenAction': [IndirectObject(616, 0), '/Fit'], '/Pages': IndirectObject(612, 0), '/Type': '/Catalog'}
(0, 608) {'/Subtype': '/XML', '/Type': '/Metadata'}
(0, 609) {'/Filter': '/FlateDecode', '/First': 6, '/N': 1, '/Type': '/ObjStm'}
(0, 612) {'/Count': 26, '/Kids': [IndirectObject(616, 0), IndirectObject(1, 0), IndirectObject(23, 0), IndirectObject(43, 0), IndirectObject(53, 0), IndirectObject(68, 0), IndirectObject(109, 0), IndirectObject(127, 0), IndirectObject(163, 0), IndirectObject(217, 0), IndirectObject(275, 0), IndirectObject(305, 0), IndirectObject(334, 0), IndirectObject(389, 0), IndirectObject(414, 0), IndirectObject(426, 0), IndirectObject(435, 0), IndirectObject(460, 0), IndirectObject(468, 0), IndirectObject(478, 0), IndirectObject(489, 0), IndirectObject(508, 0), IndirectObject(518, 0), IndirectObject(540, 0), IndirectObject(554, 0), IndirectObject(575, 0)], '/Type': '/Pages'}

so pdf.resolvedObjects is clearly changing somehow, in that the __init__ method seems to be giving something incomplete. I can make a workable-ish workaround via:

In [15]: pdf = PyPDF4.PdfFileReader('some_file.pdf')

In [16]: pdf._flatten()

In [17]: for key, val in pdf.resolvedObjects.items():
    ...:     print(key, val)
    ...:     
(0, 634) {'/DecodeParms': {'/Columns': 3, '/Predictor': 12}, '/Filter': '/FlateDecode', '/ID': [b'<edited out>', b'<edited out>'], '/Index': [614, 22], '/Info': IndirectObject(613, 0), '/Prev': 1820112, '/Root': IndirectObject(615, 0), '/Size': 636, '/Type': '/XRef', '/W': [1, 2, 0]}
(0, 611) {'/DecodeParms': {'/Columns': 4, '/Predictor': 12}, '/Filter': '/FlateDecode', '/ID': [b'<edited out>', b'<edited out>'], '/Info': IndirectObject(613, 0), '/Root': IndirectObject(615, 0), '/Size': 614, '/Type': '/XRef', '/W': [1, 3, 0]}
(0, 615) {'/Metadata': IndirectObject(608, 0), '/OpenAction': [IndirectObject(616, 0), '/Fit'], '/Pages': IndirectObject(612, 0), '/Type': '/Catalog'}
(0, 609) {'/Filter': '/FlateDecode', '/First': 6, '/N': 1, '/Type': '/ObjStm'}
(0, 612) {'/Count': 26, '/Kids': [IndirectObject(616, 0), IndirectObject(1, 0), IndirectObject(23, 0), IndirectObject(43, 0), IndirectObject(53, 0), IndirectObject(68, 0), IndirectObject(109, 0), IndirectObject(127, 0), IndirectObject(163, 0), IndirectObject(217, 0), IndirectObject(275, 0), IndirectObject(305, 0), IndirectObject(334, 0), IndirectObject(389, 0), IndirectObject(414, 0), IndirectObject(426, 0), IndirectObject(435, 0), IndirectObject(460, 0), IndirectObject(468, 0), IndirectObject(478, 0), IndirectObject(489, 0), IndirectObject(508, 0), IndirectObject(518, 0), IndirectObject(540, 0), IndirectObject(554, 0), IndirectObject(575, 0)], '/Type': '/Pages'}
.
.
.

so I can at least do the tasks I need to do with the document, but making a call to pdf.resolvedObjects still raises an exception the first time I try to use it.

Any idea what may be causing this? Would be more than happy to help with a fix if I can get some help tracking down the source of the problem.

acsor commented 6 years ago

It appears you were using Python 3.6 in that example, am I right? What do you get with Python 2 instead?

acsor commented 6 years ago

I am still fairly new to the codebase and I haven't delved deeply into this specific issue yet, but of the PDF samples in the PDF_Samples/ dir. none seems to have their list of objects read. I wouldn't exclude to have a wider problem than the one just declared.

from os import listdir
from PyPDF4 import PdfFileReader

DIR = "PDF_Samples/"

for f in listdir(DIR):
    if f.endswith(".pdf"):
        r = PdfFileReader(DIR + f)
        print("len(r.resolvedObjects) = %d" % len(r.resolvedObjects))
$ python3 ./resolved_objects.py 
len(r.resolvedObjects) = 0
len(r.resolvedObjects) = 0
PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1799]
len(r.resolvedObjects) = 0
len(r.resolvedObjects) = 0
len(r.resolvedObjects) = 0
len(r.resolvedObjects) = 0

The vary same is yielded by Python 2.

The method responsible for populating PdfFileReader.resolvedObjects is cacheIndirectObject(). Its stack trace would ideally be: __init__() > read() > cacheIndirectObject() but some internal machinery prevents cacheIndirectObject() from being ever reached (which I'm sure about for having placed a quick'n'dirty print() statement at the beginning of the method).

DeliciousHair commented 6 years ago

Ah yes, I think you've found the culprit--seems to work as it should in 2.7.

Alright then, that means that the problem is a 2 vs. 3 thing buried somewhere in the PdfFileReader.read method (called on line 1148 in __init__) which uses cacheIndirectObject (among other things) to populate self. resolvedObjects.

Time to put on my detective hat I guess, that is a particularly gruesome block of code, but at least I have a hint of what's going on now.

EDIT: Sorry, I see you already said that. It's very early and my brain is still waking up it would seem.

DeliciousHair commented 6 years ago

OK, I set debug=True in PdfFileReader.read and added a really informative message after line 1866 (if x.isdgit():) just to see if the code is even making it to the caching call, and this is what I now get:

In [1]: import PyPDF4

In [2]: pdf = PyPDF4.PdfFileReader('some_file.pdf')
>>read <_io.BytesIO object at 0x11019c938>
  line: b''
  line: b'%%EOF'
****** I am here ******
read idx_pairs=[(614, 22)]
XREF Uncompressed: 614 0
XREF Uncompressed: 615 0
.
.
.

****** I am here ******
read idx_pairs=[(0, 614)]
XREF Uncompressed: 1 0
.
.
.
XREF Compressed: 612 609 0
XREF Compressed: 613 610 0

So the reader appears to be sort-of doing it's thing in that it is at least finding the items to put into the document list, and the two appearances of my very informative print statement correspond to the two objects that are appearing in the the .resolvedObjects attribute. It's just that nothing else is making it in. But this is alright, search for the culprit narrowing rapidly.

EDIT:

Alright, this is going to be much more tricky than I thought, as the code is really, really convoluted. However, what seems to be the happening is that __init__ calling self.read(stream), which does a lot of hard-to-follow things, but the main hangup seems to be that within some of the process there are calls to self.getObject(), only they are called via things like:

...
            fields = tree["/Fields"]
            for f in fields:
                field = f.getObject()
...

but this doesn't make sense at this point, since f is not at attribute of self at the onset, so how can it have a method .getObject() to call? So it seems like __init__ is then leaving .resolvedObjects as pointer to stuff, so that when I try to call it within IPython which forces everything to be evaluated, we end up with the error that I got in the first place.

I may, of course, be going the totally wrong way with this, but it seems a plausible culprit at this point; any thoughts?

acsor commented 6 years ago

First off, what revision have you checked out while testing this code, @DeliciousHair? I'm going to pinpoint the problem across the latest commits and see if it has been introduced there. (Your HEAD tip I assume to be claird:master anyway.)

DeliciousHair commented 6 years ago

I'm assuming that you're asking about this?

$ git describe --tags
v1.27.0-9-g2ca3e19

FWIW, I'm in the process of tracing out how PdfFileReader.read() actually is functioning, as I think the problem is down a bunch of convoluted back-and-forth that should likely be streamlined anyway.

DeliciousHair commented 6 years ago

Getting closer. Maybe.

In .read() there is the line:

                newTrailer = readObject(stream, self)

actually there are a few calls to readObject(stream, self), but following this through piece by piece we see around line 1664 (I've made some changes, hence the approximation) there is:

            self.stream.seek(start, 0)

which is clearly a problem when this is being called from within __init__, since self.stream is only defined after self.read(stream) has been run. Simply switching the order of these lines in __init__ doesn't fix thing unfortunately.

acsor commented 6 years ago

I do not maintain this repository, but I leave #11 at you alone instead of working simultaneously at the same thing. If you (think you) have solved the problem, I invite you to submit a PR; if not you can pass the issue to me sharing what you have found while working at it.

Besides this issue, I have the suspect that resolvedObjects should on average contain more object references that it currently does. Much of PyPDF4 has come untested and, if it is the case that it doesn't meet to the specifications, it would do it some good being refactored/recoded (where relevant) and have some unit tests deployed.

DeliciousHair commented 6 years ago

No problem, I'll keep plugging away at this as I could really stand to have it working but I should add that I am not really much of a coder, so my repairs could end up being as much of a mess as the current thing.

To try and meet said specifications, where can they be found? In particular, I have completely abandoned 2.7 some time ago now and a lot of cleaning-up could be accomplished immediately by dropping a bunch of the wrappers for binary data present that are essentially pointless for 3.x.

acsor commented 6 years ago

To try and meet said specifications, where can they be found?

Eh eh, good question. I've been contributing to this project lately assuming that there were some, while indeed I have added an incomplete bare-minimum of mine in README.md and many others are still lacking.

I think that the project owner should be concerned with setting up some simple but effective contribution guidelines for allowing casual contributors such as me and you to stay in line with the rules. (Hey @claird, I'm open for that position, i.e. drafting simple but useful contribution rules!)

In particular, I have completely abandoned 2.7

Good thing. Quoting the requests library, you chose Python 3+ and you're a person of taste. FYI, PyPDF4 seems will be supporting 2.7 and version 3, which is definitely recommended.

I am not really much of a coder

If that's so, I'll act as an intermediary between this issue and a possible future pull request that will solve it. You point me to the alleged mistake in the code and I'll take care of fixing it through a PR. Mail me if you want to keep in contact ;-).

acsor commented 6 years ago

Hi again @DeliciousHair. I cannot yet be 100% sure about this, but it seems that resolvedObjects is intended for internal use only (and should, indeed, be renamed to _resolvedObjects, like many other alleged public methods).

So it looks like you were doing an incorrect use of the library. What I would suggest, instead, is to rely on the PdfFileReader.getObject() method, which I have documented in one of my latest revisions, and whose use is demonstrated in part below:

from os.path import join
from PyPDF4 import PdfFileReader
from PyPDF4.generic import IndirectObject

DIR = "PDF_Samples/"
file = "AutoCad_Diagram.pdf"

r = PdfFileReader(join(DIR, file), debug=False)

o = filter(
    lambda e: e is not None,
    [r.getObject(IndirectObject(idnum, 0, r)) for idnum in range(1, 19)]
)
o = list(o)

print("len(o) == %d\n" % len(o))
$ python3 ./resolved_objects.py 
len(o) == 18

Now -- I know, I know. How to know which object references are there? This is something which I am trying to see myself. getObject() requires an IndirectObject instance with the generation and identifier numbers, but you have to know these. For what I've seen until now, no public methods/properties store the list of allowable (gen. num, identifier) entries; resolvedObjects seems to act as a cache dictionary and indeed you were surprised to see it changing unfathomably.

That said, if I've been correct in my analysis, a feature to extract the list of indirect objects in the File Body indexed by the Cross-Reference Table (this is all ISO 32000/PDF jargon) could be definitely added :+1:.

DeliciousHair commented 6 years ago

@newnone:

Wow, that's a lot of stuff done! Apologies for just leaving you hanging but I seem to have missed the notifications for the previous two responses. I've had to put this effort aside myself as work commitments, but will definitely be checking out your modifications very shortly though.

Great work! :-)

(in the meantime I've discovered something even stranger, but I raised that in #21 instead.)

DeliciousHair commented 6 years ago

I have tried this PR on a number of files and overall it works much better when it works, but it also falls over critically in a number of instances where master is able to trundle along, albeit in a convoluted manner. I cannot share the sample documents I'm using, but leave this with me and I'll share the logs at least. Just not today due to time constraints unfortunately.

acsor commented 6 years ago

Splendid, I'll work toward diminishing those failure cases and improving PyPDF even better. 

acsor commented 6 years ago

It seemed to me that you were using resolvedObjects (now renamed to _cachedObjects) to access the indirect objects of your PDF files. Whether that was its use or not, can we consider this issue closed now that #14 has been merged?

I remind you that if you wish to access all the indirect objects from a PDF file, you should resort to PdfFileReader.objects().

claird commented 6 years ago

My only contribution to this thread is to applaud the progress you've made.

I understand that PyPDF4 might appear to have regressed for a few specific documents. I'm sure that's part of a larger move to a richer testing suite.

Cameron Laird, vice president We make computers work for people.

On Tue, Oct 2, 2018 at 10:34 AM Oscar notifications@github.com wrote:

It seemed to me that you were using resolvedObjects (now renamed to _cachedObjects) to access the indirect objects of your PDF files. Whether that was its use or not, can we consider this issue closed now that #14 https://github.com/claird/PyPDF4/pull/14 has been merged?

I remind you that if you wish to access all the indirect objects from a PDF file, you should resort to PdfFileReader.objects().

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/claird/PyPDF4/issues/11#issuecomment-426320209, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbN9LLWkeBwcIFeDt-9aaNtvp35cjZ_ks5ug4eigaJpZM4WWcpx .

DeliciousHair commented 6 years ago

I think I've found a bit of a pattern to the hard-fails. I've got a large volume of documents that are TIFF scans that are placed into a PDF container, and a large minority of them have been further modified using, presumably, something like MS paint and then exported to PDF via ghostscript. A convoluted process for sure, but I have no control over the source material.

Regardless, with the previous (ie, pre #14 merge) the behaviour with these documents was problematic, but the new merge makes them completely inaccessible:

In [1]: import pypdf

In [2]: pdf = pypdf.PdfFileReader('failing_sample.pdf')
---------------------------------------------------------------------------
PdfReadError                              Traceback (most recent call last)
<ipython-input-2-8ac78e782e5f> in <module>()
----> 1 pdf = pypdf.PdfFileReader('failing_sample.pdf')

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/pdf.py in __init__(self, stream, strict, warndest, overwriteWarnings, debug)
   1311 
   1312         self.stream = stream
-> 1313         self._parsePdfFile(stream)
   1314 
   1315     def __repr__(self):

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/pdf.py in _parsePdfFile(self, stream)
   2369                         elif self.strict:
   2370                             raise PdfReadError(
-> 2371                                 "Unknown xref type: %s" % xrefType
   2372                             )
   2373 

PdfReadError: Unknown xref type: 255

In [3]: pdf = pypdf.PdfFileReader('failing_sample.pdf', strict=False)

In [4]: pdf.getPage(0)
PdfReadWarning: Object 1 0 not defined. [pdf.py:2076]
---------------------------------------------------------------------------
PdfReadError                              Traceback (most recent call last)
<ipython-input-4-b34ec9cc413a> in <module>()
----> 1 pdf.getPage(0)

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/pdf.py in getPage(self, pageNumber)
   1461         # Ensure that we're not trying to access an encrypted PDF
   1462         if self._flattenedPages is None:
-> 1463             self._flatten()
   1464 
   1465         return self._flattenedPages[pageNumber]

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/pdf.py in _flatten(self, pages, inherit, indirectRef)
   1814         if pages is None:
   1815             self._flattenedPages = []
-> 1816             catalog = self._trailer["/Root"].getObject()
   1817             pages = catalog["/Pages"].getObject()
   1818 

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/generic.py in __getitem__(self, key)
    570 
    571     def __getitem__(self, key):
--> 572         return dict.__getitem__(self, key).getObject()
    573 
    574     def getXmpMetadata(self):

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/generic.py in getObject(self)
    198 
    199     def getObject(self):
--> 200         return self.pdf.getObject(self).getObject()
    201 
    202     def __repr__(self):

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/pdf.py in getObject(self, ref)
   2077             )
   2078             raise PdfReadError(
-> 2079                 "Could not find object (%d, %d)" % (ref.idnum, ref.generation)
   2080             )
   2081 

PdfReadError: Could not find object (1, 0)

vs. with the pre-#14 version:

In [1]: import PyPDF4 as pypdf

In [2]: pdf = pypdf.PdfFileReader('failing_sample.pdf')

In [3]: pdf.getPage(0).keys()
Out[3]: dict_keys(['/Resources', '/MediaBox', '/Type', '/Parent', '/Contents', '/Rotate'])

If you like, I may be able to share some prototype examples with you directly; I suspect that this is now a fairly small fix to the monumental amount of work you've already done. Please get in touch with me directly if you're interested.

In the meantime, I will start going through what you have done for #14 and see if I can figure out the failure point as well as I really want to migrate to the newer version--when it works, it works sooo good! Excellent job!

acsor commented 6 years ago

If you like, I may be able to share some prototype examples with you directly

I ask you the sample files, without hesitation. If need to be private, head them to nildexo@yandex.com.

claird commented 6 years ago

"... I have no control over the source material ...": I assume that all of us with any degree of expertise in PDF recognize that these workflows we program are largely mistakes that can't be better rationalized because of some external constraint. PDF work always seems to be that way.

My summary: we understand that you're working with imperfect materials. I personally very much appreciate your efforts, DeliciousHair, to improve PyPDF4's still so-primitive testing.

Cameron Laird, vice president We make computers work for people.

On Fri, Oct 5, 2018 at 5:15 PM DeliciousHair notifications@github.com wrote:

I think I've found a bit of a pattern to the hard-fails. I've got a large volume of documents that are TIFF scans that are placed into a PDF container, and a large minority of them have been further modified using, presumably, something like MS paint and then exported to PDF via ghostscript. A convoluted process for sure, but I have no control over the source material.

Regardless, with the previous (ie, pre #14 https://github.com/claird/PyPDF4/pull/14 merge) the behaviour with these documents was problematic, but the new merge makes them completely inaccessible:

In [1]: import pypdf

In [2]: pdf = pypdf.PdfFileReader('failing_sample.pdf')

PdfReadError Traceback (most recent call last)

in () ----> 1 pdf = pypdf.PdfFileReader('failing_sample.pdf') /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/pdf.py in __init__(self, stream, strict, warndest, overwriteWarnings, debug) 1311 1312 self.stream = stream -> 1313 self._parsePdfFile(stream) 1314 1315 def __repr__(self): /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/pdf.py in _parsePdfFile(self, stream) 2369 elif self.strict: 2370 raise PdfReadError( -> 2371 "Unknown xref type: %s" % xrefType 2372 ) 2373 PdfReadError: Unknown xref type: 255 In [3]: pdf = pypdf.PdfFileReader('failing_sample.pdf', strict=False) In [4]: pdf.getPage(0) PdfReadWarning: Object 1 0 not defined. [pdf.py:2076] --------------------------------------------------------------------------- PdfReadError Traceback (most recent call last) in () ----> 1 pdf.getPage(0) /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/pdf.py in getPage(self, pageNumber) 1461 # Ensure that we're not trying to access an encrypted PDF 1462 if self._flattenedPages is None: -> 1463 self._flatten() 1464 1465 return self._flattenedPages[pageNumber] /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/pdf.py in _flatten(self, pages, inherit, indirectRef) 1814 if pages is None: 1815 self._flattenedPages = [] -> 1816 catalog = self._trailer["/Root"].getObject() 1817 pages = catalog["/Pages"].getObject() 1818 /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/generic.py in __getitem__(self, key) 570 571 def __getitem__(self, key): --> 572 return dict.__getitem__(self, key).getObject() 573 574 def getXmpMetadata(self): /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/generic.py in getObject(self) 198 199 def getObject(self): --> 200 return self.pdf.getObject(self).getObject() 201 202 def __repr__(self): /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/pdf.py in getObject(self, ref) 2077 ) 2078 raise PdfReadError( -> 2079 "Could not find object (%d, %d)" % (ref.idnum, ref.generation) 2080 ) 2081 PdfReadError: Could not find object (1, 0) vs. with the pre-#14 version: In [1]: import PyPDF4 as pypdf In [2]: pdf = pypdf.PdfFileReader('failing_sample.pdf') In [3]: pdf.getPage(0).keys() Out[3]: dict_keys(['/Resources', '/MediaBox', '/Type', '/Parent', '/Contents', '/Rotate']) If you like, I may be able to share some prototype examples with you directly; I suspect that this is now a fairly small fix to the monumental amount of work you've already done. Please get in touch with me directly if you're interested. In the meantime, I will start going through what you have done for #14 and see if I can figure out the failure point as well as I *really* want to migrate to the newer version--when it works, it works sooo good! Excellent job! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or mute the thread .
acsor commented 6 years ago

Initially I suspected whether some of the filters in filters.py might be causing the problem, and that probably was a most far fetched hypothesis. I just took a few seconds of inspection into the stack trace to note an "xref" (very poor nomenclature, to be changed) equal to 255. Neither PyPDF nor any other PDF software can do anything about it AFAIK, judging from the 2008 ISO 32000 standard.

My suggestion: set strict=False in PdfFileReader __init__() and see what happens. The nature of this "problem" stands out very clearly to me.

acsor commented 6 years ago

Paragraph 7.5.8.3 has a relevant excerpt from the standard. We do not interpret unrecognized Cross-Reference Stream types as references to the null value, but report them.

paragraph

DeliciousHair commented 6 years ago

Yup, that is correct. Notice, however, that I did try using strict=False in the __init__ method which led to the error of being unable to flatten the PDF document.

Side rant, I find it very frustrating that Adobe has made their product so robust that tools that create totally non-compliant documents still manage to render as expected; makes tasks like this needlessly difficult! :-)

EDIT: also note that I am able to brute-force access to the document via the pre-#14 version of PyPDF

eykamp commented 4 years ago

Note that the changes suggested here:

https://stackoverflow.com/questions/45978113/pypdf2-write-doesnt-work-on-some-pdf-files-python-3-5-1/52687771#52687771

fixed the problem for me. The line numbers have changed, but the changes still "fit".

gonultasbu commented 4 years ago

The suggested changes are proposed by me, let me know if a PR is needed, although that might break some other unknown functionalities.