Failure with some files: pypdf2: multiple definitions in dictionary

jwhendy commented 8 years ago

I'm getting this error with some html pages I printed as pdfs and want to concatenate:

$ stapler cat blog_bup*.pdf output.pdf
Traceback (most recent call last):
  File "/usr/bin/stapler", line 15, in <module>
    stapler.main()
  File "/opt/stapler/staplelib/stapler.py", line 79, in main
    modes[mode](args)
  File "/opt/stapler/staplelib/commands.py", line 58, in select
    iohelper.write_pdf(output, outputfilename)
  File "/opt/stapler/staplelib/iohelper.py", line 52, in write_pdf
    pdf.write(outputStream)
  File "/usr/lib/python2.7/site-packages/PyPDF2/pdf.py", line 482, in write
    self._sweepIndirectReferences(externalReferenceMap, self._root)
  File "/usr/lib/python2.7/site-packages/PyPDF2/pdf.py", line 571, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "/usr/lib/python2.7/site-packages/PyPDF2/pdf.py", line 547, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "/usr/lib/python2.7/site-packages/PyPDF2/pdf.py", line 571, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "/usr/lib/python2.7/site-packages/PyPDF2/pdf.py", line 547, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "/usr/lib/python2.7/site-packages/PyPDF2/pdf.py", line 556, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, data[i])
  File "/usr/lib/python2.7/site-packages/PyPDF2/pdf.py", line 571, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "/usr/lib/python2.7/site-packages/PyPDF2/pdf.py", line 547, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "/usr/lib/python2.7/site-packages/PyPDF2/pdf.py", line 547, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "/usr/lib/python2.7/site-packages/PyPDF2/pdf.py", line 547, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "/usr/lib/python2.7/site-packages/PyPDF2/pdf.py", line 577, in _sweepIndirectReferences
    newobj = data.pdf.getObject(data)
  File "/usr/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1611, in getObject
    retval = readObject(self.stream, self)
  File "/usr/lib/python2.7/site-packages/PyPDF2/generic.py", line 66, in readObject
    return DictionaryObject.readFromStream(stream, pdf)
  File "/usr/lib/python2.7/site-packages/PyPDF2/generic.py", line 585, in readFromStream
    % (utils.hexStr(stream.tell()), key))
PyPDF2.utils.PdfReadError: Multiple definitions in dictionary at byte 0x3246eb for key /Type

I can successfully cat other test pdfs together, so I'm not sure what's going on. Unfortunately, it's a backup of an old blog I'd care not to share but I'd be happy to hunt around if you might know what to look for. I took a look at the binary pdf file in nano and see stuff a bunch of /Types: /XObject, /Catalog, /Pages, /Annot, etc.

Is there an issue with the file having many of these?

Perhaps this is unique to pypdf2 but I wanted to mention it so you're aware. If you can confirm, I'll report there instead. I get this with both manually pulling this as well as installing the Arch AUR package.

fwenzel commented 8 years ago

When you look at the traceback you see at the bottom that it explodes in the PyPDF2 package, so it's not inherently a stapler bug. So I suggest you file a bug there. While I understand that you don't want to share the file, do note that this makes it a lot harder to analyze. Maybe you could figure out what page this is happening on and extract that page and only share that with the authors?

jwhendy commented 8 years ago

Thanks @fwenzel . While it said it was pypdf2, I wasn't positive if this was inherent with that package or perhaps some call/implementation you're using. I'm not really a python guy, and wasn't sure if every error indicated that the issue was definitely with the library, or perhaps some update changed a call, for example, and how you were using it might have required a change.

I think it's clear now, so I'll close this. I'll see if I can find some reproducible way to create the bug. For now, I was able to just use pdfunite as discussed here, so my motivation has decreased slightly :)

jwhendy commented 8 years ago

Re-opening after searching around again... I ran into this issue on pypdf2 last night, but didn't dawn on me to try applying to your files.

I grepped for PdfFileReader in your files and got a hit in staplelib/iohelper.py:

pdf = PdfFileReader(file(filename, "rb"))

Applying the change reference in the pypdf2 bug, I made it:

pdf = PdfFileReader(file(filename, "rb"), strict=False)

That spits out a big list of errors about multiple dictionary definitions for /Type, but still creates what looks like an identical file. I spot checked 5 or so pages in the 228 page output and they all match.

Is this still a pypdf2 bug, or do you think implementing the strict=False bit would be of interest in your code? It looks like this bug led him to create that error message, so I think it might be inherent in how pypdf2 works. Thus, I see two options:

your code stays as-is, and those with multiple /Type values can't cat pdfs with stapler
if the strict=False option doesn't have unintended side-effects, stapler can still work for these conditions, operating within this error bypass provided by pypdf2

fwenzel commented 8 years ago

Aha! Thanks for pointing that out. I don't see why we'd do strict=True. Stapler should handle as many files as possible, even if there are some issues buried in them. For instance, within reason, stapler should probably be able to merge two malformed pages into one two-page document with the same errors inside the pages present.

jwhendy commented 8 years ago

No problem, and thanks for taking a look! I left a comment on pypdf2 to try and understand if this is a bug, per se, just a limitation, something they want to improve, etc. Haven't gotten a comment back there, so not really sure how to look at this...

fwenzel / stapler

Failure with some files: pypdf2: multiple definitions in dictionary #7