Open jwhendy opened 8 years ago
When you look at the traceback you see at the bottom that it explodes in the PyPDF2 package, so it's not inherently a stapler bug. So I suggest you file a bug there. While I understand that you don't want to share the file, do note that this makes it a lot harder to analyze. Maybe you could figure out what page this is happening on and extract that page and only share that with the authors?
Thanks @fwenzel . While it said it was pypdf2
, I wasn't positive if this was inherent with that package or perhaps some call/implementation you're using. I'm not really a python
guy, and wasn't sure if every error indicated that the issue was definitely with the library, or perhaps some update changed a call, for example, and how you were using it might have required a change.
I think it's clear now, so I'll close this. I'll see if I can find some reproducible way to create the bug. For now, I was able to just use pdfunite
as discussed here, so my motivation has decreased slightly :)
Re-opening after searching around again... I ran into this issue on pypdf2 last night, but didn't dawn on me to try applying to your files.
I grepped for PdfFileReader
in your files and got a hit in staplelib/iohelper.py
:
pdf = PdfFileReader(file(filename, "rb"))
Applying the change reference in the pypdf2
bug, I made it:
pdf = PdfFileReader(file(filename, "rb"), strict=False)
That spits out a big list of errors about multiple dictionary definitions for /Type
, but still creates what looks like an identical file. I spot checked 5 or so pages in the 228 page output and they all match.
Is this still a pypdf2
bug, or do you think implementing the strict=False
bit would be of interest in your code? It looks like this bug led him to create that error message, so I think it might be inherent in how pypdf2
works. Thus, I see two options:
/Type
values can't cat
pdfs with stapler
strict=False
option doesn't have unintended side-effects, stapler
can still work for these conditions, operating within this error bypass provided by pypdf2
Aha! Thanks for pointing that out. I don't see why we'd do strict=True. Stapler should handle as many files as possible, even if there are some issues buried in them. For instance, within reason, stapler should probably be able to merge two malformed pages into one two-page document with the same errors inside the pages present.
No problem, and thanks for taking a look! I left a comment on pypdf2
to try and understand if this is a bug, per se, just a limitation, something they want to improve, etc. Haven't gotten a comment back there, so not really sure how to look at this...
I'm getting this error with some html pages I printed as pdfs and want to concatenate:
I can successfully
cat
other test pdfs together, so I'm not sure what's going on. Unfortunately, it's a backup of an old blog I'd care not to share but I'd be happy to hunt around if you might know what to look for. I took a look at the binarypdf
file innano
and see stuff a bunch of/Type
s:/XObject
,/Catalog
,/Pages
,/Annot
, etc.Is there an issue with the file having many of these?
Perhaps this is unique to
pypdf2
but I wanted to mention it so you're aware. If you can confirm, I'll report there instead. I get this with both manually pulling this as well as installing the Arch AUR package.