Hopding / pdf-lib

Create and modify PDF documents in any JavaScript environment
https://pdf-lib.js.org
MIT License
6.8k stars 648 forks source link

Incorrectly Parsed Object on Microsoft invoice PDF #119

Closed DanielJackson-Oslo closed 5 years ago

DanielJackson-Oslo commented 5 years ago

Hi!

Thanks for a really welcome module.

I'm encountering thousands of different kinds of PDFs generated by other people, and got into some trouble with one specific one from Microsoft, getting the following error:

Incorrectly parsed object contents

These are the PDFs that I try to combine, I think the offending one is the top one as it's the only one not generated by Puppeteer: Din_Microsoft-fakturaoversikt.pdf 3e63ebd0-8775-11e9-888e-1f95e38b402c.pdf

Presumably the PDF doesn't follow the standards, though there's little I can do about that.

My use case is to combine this PDF with a generated page that gives some info about it, for accounting purposes. As such, I don't really need to parse it any more than what's needed to append it to my PDF.

My code looks as follows:

// pdfsToMerge is an array of filePaths
async function mergePdfs(pdfsToMerge, filePath) {
  const mergedPdf = PDFDocumentFactory.create();
  pdfsToMerge.forEach(pdfFilePath => {
    const pdf = fs.readFileSync(pdfFilePath)
    const pagesToMerge = PDFDocumentFactory.load(pdf).getPages()
    pagesToMerge.forEach( page => {
      mergedPdf.addPage(page)
    })
  })
  const mergedPdfFile = await PDFDocumentWriter.saveToBytes(mergedPdf)
  await fs.writeFileSync(filePath, mergedPdfFile)
  logger.verbose("Merged PDFs", { mergedPdfs: pdfsToMerge, filePath });
  return
}
DanielJackson-Oslo commented 5 years ago

https://www.pdfen.com/pdf-a-validator gives no errors for the file.

https://www.datalogics.com/products/pdftools/pdf-checker/ gives the following output for it, suggesting that the only error is some missing fonts?

PDF Checker 1.4.1  Copyright 2018-2019 Datalogics, Inc. All Rights Reserved

Wed Jun  5 04:31:19 2019

JSON Profile: everything.json

Input Document: DinMicrosoft-fakturaoversikt.pdf

<<=CHECKER_SUMMARY_START=>>
fonts:uses-base14fonts-not-embedded
<<=CHECKER_SUMMARY_END=>>

General Results
    Errors:
        None
    Information:
        None
    Checks Completed:
        claims-pdfa-conformance
        contains-owner-password
        contains-signature
        damaged
        password-protected
        pdf-v2
        unable-to-open
        xfa-type

Userdata Results
    Errors:
        None
    Information:
        None
    Checks Completed:
        contains-annots
        contains-annots-not-for-printing
        contains-annots-not-for-viewing
        contains-annots-without-normal-appearances
        contains-embedded-files
        contains-metadata
        contains-optional-content
        contains-private-data
        contains-transparency

Fonts Results
    Errors:
        Uses Base 14 fonts not embedded in document: 
            Helvetica (1 instance)
            Helvetica-Bold (1 instance)
    Information:
        None
    Checks Completed:
        fontdescriptor-missing-capheight
        fontdescriptor-missing-fields
        uses-base14fonts-not-embedded
        uses-fonts-fully-embedded
        uses-fonts-not-embedded

Objects Results
    Errors:
        None
    Information:
        None
    Checks Completed:
        contains-javascript-actions
        contains-thumbnails

Cleanup Results
    Errors:
        None
    Information:
        None
    Checks Completed:
        suboptimal-compression

Image Results
    Errors:
        None
    Information:
        None
    Checks Completed:
        alternate-images

    Color Images
    Errors:
        None
    Information:
        None
    Checks Completed:
        image-depth
        resolution-too-high
        resolution-too-low
        uses-jpeg2000-compression

    Grayscale Images
    Errors:
        None
    Information:
        None
    Checks Completed:
        resolution-too-high
        resolution-too-low
        uses-jpeg2000-compression

    Monochrome Images
    Errors:
        None
    Information:
        None
    Checks Completed:
        resolution-too-high
        resolution-too-low
        uses-jbig2-compression
Hopding commented 5 years ago

Hello @DanielJackson-Oslo!

I ran the Din_Microsoft-fakturaoversikt.pdf file you shared through qpdf (a very useful PDF validation tool). It turns out the file is technically invalid:

$ qpdf --check ~/Din_Microsoft-fakturaoversikt.pdf
checking ~/Din_Microsoft-fakturaoversikt.pdf
PDF Version: 1.3
File is not encrypted
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (object 6 0, offset 121): stream keyword not followed by proper line terminator
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (object 6 0, offset 1971): expected endstream
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (object 6 0, offset 121): attempting to recover stream length
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (object 6 0, offset 121): recovered stream length: 1854
File is not linearized
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (object 9 0, offset 14325): stream keyword not followed by proper line terminator
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (object 9 0, offset 15835): expected endstream
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (object 9 0, offset 14325): attempting to recover stream length
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (object 9 0, offset 14325): recovered stream length: 1514
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (offset 121): error decoding stream data for object 6 0: stream inflate: inflate: data: incorrect header check
page 1: content stream (content stream object 6 0): errors while decoding content stream
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (offset 14325): error decoding stream data for object 9 0: stream inflate: inflate: data: incorrect header check
page 2: content stream (content stream object 9 0): errors while decoding content stream

Two of the stream objects contained in this file are corrupt. This is why pdf-lib throws an error when trying to parse it.

That being said, I think it would be possible to adapt pdf-lib's parser to tolerate these specific stream errors. I'll look into this and get back with you.

Hopding commented 5 years ago

I just cut version 0.6.4-rc1 of pdf-lib. It contains a fix for this issue.

You can install this prerelease with npm:

npm install pdf-lib@0.6.4-rc1

It's also available on unpkg:

Please try it out and let me know if it works for you!

Hopding commented 5 years ago

@DanielJackson-Oslo I'd like to add the Din_Microsoft-fakturaoversikt.pdf file you shared to the pdf-lib GitHub repo to create a regression test for this issue. Do you mind? Does the file contain any sensitive information?

It looks like it might be a test billing statement? But I can't tell for sure since it's not written in English.

DanielJackson-Oslo commented 5 years ago

@DanielJackson-Oslo I'd like to add the Din_Microsoft-fakturaoversikt.pdf file you shared to the pdf-lib GitHub repo to create a regression test for this issue. Do you mind? Does the file contain any sensitive information?

@Hopding Feel free to use it! It's a bill for my own Office 365, presumably the same one they generate for all customers.

Thanks for the quick follow up. Looking forward to 0.6.4 releasing. How stable is the rc?

DanielJackson-Oslo commented 5 years ago

@Hopding Since this isn't the first time this sort of problem has come up I'd imagine there will be hundreds, if not thousands, of different ways that PDFs can be malformed but still render in most PDF readers, and thus exist in the wild.

I don't know much about the technical nature of PDFs, but for my use case, I'd really only need pdf-lib to recognize where the PDF pages start, and then copy those into a new PDF without further validating them. (All I want to do is merge two PDFs, I don't need any control or understanding of the contents).

I see that there's a "copy" function in the library, is that what that function does? If not, could I somehow help write a "merge blindly" function?

Hopding commented 5 years ago

@DanielJackson-Oslo The RC should be perfectly stable. The only change it includes is the fix for this issue. And of course, it passed all the unit and integration tests before I cut it. So if it's working well for you, then there shouldn't be anything to worry about. (I always cut RCs for every release, no matter how trivial the changes).

It would certainly be possible to get away with less object parsing (and therefore tolerate more invalid objects) if you just want to copy pages. However, in order to find and copy the page objects (and any other objects they reference) it is still necessary to parse some objects.

Implementing this sort of "lazy parsing" would take more than just writing a function, though. It would be necessary to modify some of pdf-lib's parsing code. The parser currently scans input PDFs from start to finish, parsing each object it encounters along the way.

If this is something you'd be interested in working on, I'd be open to working with you on it. Just note that it would require learning about the structure of PDF files. Please open a new issue if you'd like to continue the discussion further!

DanielJackson-Oslo commented 5 years ago

@Hopding 0.6.4rc1 fixes the issue on my end! 🎉 . Should I close this thread?

If this is something you'd be interested in working on, I'd be open to working with you on it. Just note that it would require learning about the structure of PDF files. Please open a new issue if you'd like to continue the discussion further!

I'll read up a bit on PDF structure, and open a new issue for it. Thank you so much for the active help!