Closed DanielJackson-Oslo closed 5 years ago
https://www.pdfen.com/pdf-a-validator gives no errors for the file.
https://www.datalogics.com/products/pdftools/pdf-checker/ gives the following output for it, suggesting that the only error is some missing fonts?
PDF Checker 1.4.1 Copyright 2018-2019 Datalogics, Inc. All Rights Reserved
Wed Jun 5 04:31:19 2019
JSON Profile: everything.json
Input Document: DinMicrosoft-fakturaoversikt.pdf
<<=CHECKER_SUMMARY_START=>>
fonts:uses-base14fonts-not-embedded
<<=CHECKER_SUMMARY_END=>>
General Results
Errors:
None
Information:
None
Checks Completed:
claims-pdfa-conformance
contains-owner-password
contains-signature
damaged
password-protected
pdf-v2
unable-to-open
xfa-type
Userdata Results
Errors:
None
Information:
None
Checks Completed:
contains-annots
contains-annots-not-for-printing
contains-annots-not-for-viewing
contains-annots-without-normal-appearances
contains-embedded-files
contains-metadata
contains-optional-content
contains-private-data
contains-transparency
Fonts Results
Errors:
Uses Base 14 fonts not embedded in document:
Helvetica (1 instance)
Helvetica-Bold (1 instance)
Information:
None
Checks Completed:
fontdescriptor-missing-capheight
fontdescriptor-missing-fields
uses-base14fonts-not-embedded
uses-fonts-fully-embedded
uses-fonts-not-embedded
Objects Results
Errors:
None
Information:
None
Checks Completed:
contains-javascript-actions
contains-thumbnails
Cleanup Results
Errors:
None
Information:
None
Checks Completed:
suboptimal-compression
Image Results
Errors:
None
Information:
None
Checks Completed:
alternate-images
Color Images
Errors:
None
Information:
None
Checks Completed:
image-depth
resolution-too-high
resolution-too-low
uses-jpeg2000-compression
Grayscale Images
Errors:
None
Information:
None
Checks Completed:
resolution-too-high
resolution-too-low
uses-jpeg2000-compression
Monochrome Images
Errors:
None
Information:
None
Checks Completed:
resolution-too-high
resolution-too-low
uses-jbig2-compression
Hello @DanielJackson-Oslo!
I ran the Din_Microsoft-fakturaoversikt.pdf
file you shared through qpdf
(a very useful PDF validation tool). It turns out the file is technically invalid:
$ qpdf --check ~/Din_Microsoft-fakturaoversikt.pdf
checking ~/Din_Microsoft-fakturaoversikt.pdf
PDF Version: 1.3
File is not encrypted
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (object 6 0, offset 121): stream keyword not followed by proper line terminator
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (object 6 0, offset 1971): expected endstream
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (object 6 0, offset 121): attempting to recover stream length
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (object 6 0, offset 121): recovered stream length: 1854
File is not linearized
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (object 9 0, offset 14325): stream keyword not followed by proper line terminator
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (object 9 0, offset 15835): expected endstream
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (object 9 0, offset 14325): attempting to recover stream length
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (object 9 0, offset 14325): recovered stream length: 1514
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (offset 121): error decoding stream data for object 6 0: stream inflate: inflate: data: incorrect header check
page 1: content stream (content stream object 6 0): errors while decoding content stream
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (offset 14325): error decoding stream data for object 9 0: stream inflate: inflate: data: incorrect header check
page 2: content stream (content stream object 9 0): errors while decoding content stream
Two of the stream objects contained in this file are corrupt. This is why pdf-lib throws an error when trying to parse it.
That being said, I think it would be possible to adapt pdf-lib's parser to tolerate these specific stream errors. I'll look into this and get back with you.
I just cut version 0.6.4-rc1
of pdf-lib
. It contains a fix for this issue.
You can install this prerelease with npm:
npm install pdf-lib@0.6.4-rc1
It's also available on unpkg:
Please try it out and let me know if it works for you!
@DanielJackson-Oslo I'd like to add the Din_Microsoft-fakturaoversikt.pdf
file you shared to the pdf-lib
GitHub repo to create a regression test for this issue. Do you mind? Does the file contain any sensitive information?
It looks like it might be a test billing statement? But I can't tell for sure since it's not written in English.
@DanielJackson-Oslo I'd like to add the
Din_Microsoft-fakturaoversikt.pdf
file you shared to thepdf-lib
GitHub repo to create a regression test for this issue. Do you mind? Does the file contain any sensitive information?
@Hopding Feel free to use it! It's a bill for my own Office 365, presumably the same one they generate for all customers.
Thanks for the quick follow up. Looking forward to 0.6.4 releasing. How stable is the rc?
@Hopding Since this isn't the first time this sort of problem has come up I'd imagine there will be hundreds, if not thousands, of different ways that PDFs can be malformed but still render in most PDF readers, and thus exist in the wild.
I don't know much about the technical nature of PDFs, but for my use case, I'd really only need pdf-lib to recognize where the PDF pages start, and then copy those into a new PDF without further validating them. (All I want to do is merge two PDFs, I don't need any control or understanding of the contents).
I see that there's a "copy" function in the library, is that what that function does? If not, could I somehow help write a "merge blindly" function?
@DanielJackson-Oslo The RC should be perfectly stable. The only change it includes is the fix for this issue. And of course, it passed all the unit and integration tests before I cut it. So if it's working well for you, then there shouldn't be anything to worry about. (I always cut RCs for every release, no matter how trivial the changes).
It would certainly be possible to get away with less object parsing (and therefore tolerate more invalid objects) if you just want to copy pages. However, in order to find and copy the page objects (and any other objects they reference) it is still necessary to parse some objects.
Implementing this sort of "lazy parsing" would take more than just writing a function, though. It would be necessary to modify some of pdf-lib's parsing code. The parser currently scans input PDFs from start to finish, parsing each object it encounters along the way.
If this is something you'd be interested in working on, I'd be open to working with you on it. Just note that it would require learning about the structure of PDF files. Please open a new issue if you'd like to continue the discussion further!
@Hopding 0.6.4rc1 fixes the issue on my end! 🎉 . Should I close this thread?
If this is something you'd be interested in working on, I'd be open to working with you on it. Just note that it would require learning about the structure of PDF files. Please open a new issue if you'd like to continue the discussion further!
I'll read up a bit on PDF structure, and open a new issue for it. Thank you so much for the active help!
Hi!
Thanks for a really welcome module.
I'm encountering thousands of different kinds of PDFs generated by other people, and got into some trouble with one specific one from Microsoft, getting the following error:
Incorrectly parsed object contents
These are the PDFs that I try to combine, I think the offending one is the top one as it's the only one not generated by Puppeteer: Din_Microsoft-fakturaoversikt.pdf 3e63ebd0-8775-11e9-888e-1f95e38b402c.pdf
Presumably the PDF doesn't follow the standards, though there's little I can do about that.
My use case is to combine this PDF with a generated page that gives some info about it, for accounting purposes. As such, I don't really need to parse it any more than what's needed to append it to my PDF.
My code looks as follows: