Hopding / pdf-lib

Create and modify PDF documents in any JavaScript environment
https://pdf-lib.js.org
MIT License
6.9k stars 656 forks source link

Corrupt PDF issue #44

Closed digi-chris closed 5 years ago

digi-chris commented 5 years ago

Hi,

I've got a complex PDF file that doesn't save correctly when run through pdf-lib - although it does load OK and renders with other libraries. If I simply load the file and then save it again through pdf-lib, the resultant file doesn't render correctly in Chrome's PDF viewer (it seems to stop drawing before everything is on-screen), and if I open it in Adobe Reader, I get the following error:

The font 'UZCHSP+Helvetica' contains a bad /BBox.

The PDF does contain the Helvetica font, but it doesn't contain a font called 'UZCHSP+Helvetica', so I'm not sure where this is coming from.

If I open the file in Adobe Acrobat and resave it, the resultant file is a bit smaller and then pdf-lib processes it just fine.

Unfortunately, I'm not 100% sure if I can share the PDF file publicly as it is copyrighted, so currently I'm just looking for any pointers or tips as to where I could look to find the problem - and I'll share my results here.

I was thinking maybe if I could find a list of fonts used in the PDF via pdf-lib, I could then also look up the bounding boxes for the font glyphs and that might shed some light on the matter. But, I can't work out how to see the fonts that are held in the file. Any ideas?

Thanks,

Chris.

digi-chris commented 5 years ago

Just looking a bit further at this, I used PDFAssembler to open the original file, and the resultant file from pdf-lib.

PDFAssembler gives me the PDF structure, and I found the UZCHSP+Helvetica on the following path:

/Root -> /Pages -> /Kids[0] -> /Resources -> /Font -> /R8

On the original, there is a /FontDescriptor object. On the version saved from pdf-lib, that object is now null, meaning the /FontBBox is missing. I guess this is the problem - I just don't know how it's going missing.

Chris.

Hopding commented 5 years ago

Hello @digi-chris.

digi-chris commented 5 years ago

Hi, thanks for the reply!

The file isn't that large - 415kb, and only one page.

I'm not doing anything else with it, just opening the file with pdf-lib and saving it again, using the latest version (0.4.0?).

I've tried saving it without object streams - the resultant file is slighty bigger in size, but it still has the same problem.

I've found that if I load the file in Adobe Acrobat and save it again, then it works fine in pdf-lib afterwards. So, it seems clear to me that there is something unusual about the PDF. But, the original file loads OK in everything else I've tried (Chrome, Edge, PDF.JS, PDFAssembler, Adobe Reader).

Hopding commented 5 years ago

Interesting. I can't think of anything obvious that would cause this. It'll be a bit difficult to figure this out without direct access to the document. But there are a few things you can try that will give me some more info to work with:

Hopding commented 5 years ago

Can you also please share the script that you are using to open and save the document with me? (Or if you can't share the actual script, write up a minimal working example). It's possible your code is feeding the document incorrectly to pdf-lib.

digi-chris commented 5 years ago

Thanks again, here's the basic script I'm using. I cut it down to literally just opening and saving the file to make sure there wasn't any other problems:

var fs = require("fs");
var pdflib = require("pdf-lib");
var PDFDocumentFactory = pdflib.PDFDocumentFactory;
var PDFDocumentWriter = pdflib.PDFDocumentWriter;

const existingPdfDocBytes = fs.readFileSync("input.pdf");
const pdfDoc = PDFDocumentFactory.load(existingPdfDocBytes);
const pdfBytes = PDFDocumentWriter.saveToBytes(pdfDoc, { useObjectStreams: false });
fs.writeFileSync("pdfout.pdf", pdfBytes);
console.log("Done.");

I also ran the PDF through the validator - it comes back stating:

The document does conform to the PDF 1.3 standard.

I'll try switching to 0.3.0 and using qpdf next :)

Thanks for the help on this. I might be able to send the file directly to you, but I'll do these other tests first.

Hopding commented 5 years ago

What version of Node are you using? If it's below 7.4.0, you'll have to use the Buffer.from method when saving the bytes to a file (see: https://github.com/Hopding/pdf-lib/issues/16#issuecomment-405236116). (I suspect this is not your problem though, because from your description, it sounds like the PDF partially loads?)

digi-chris commented 5 years ago

I'm using version 8.12.0 - yes, the PDF does partially load. After loading and saving through pdf-lib, it seems like only some of the font data has disappeared - the file still actually loads in Adobe Reader, and displays all the text, it just gives me that error I mentioned in the first post.

Other PDFs load and save fine, but I have a whole batch of PDFs that I think must have been saved with the same software (not sure what yet) that display this weird error.

I've tried checking the file with qpdf now, and it also seems to work - I get the following output:

PDF Version: 1.3
File is not encrypted
File is not linearized
No syntax or stream encoding errors found; the file may still contain
errors that qpdf cannot detect

I've also rolled back to 0.3.0, and the issue still exists. Really strange!

Hopding commented 5 years ago

Hello @digi-chris. Sorry for not having responded these past few days, I've been pretty busy. But I haven't forgotten this issue, and I'd still very much like to get to the bottom of this and fix it.

Is the output you shared from qpdf the result of running --check on the initial version of the document, or the version saved by pdf-lib? Whichever it is, can you please share the results of checking both?

Can you please try checking which software produced (and last modified) the PDF file? You should be able to find this as metadata of the document (note that not all documents will contain this information).

Also, if you are able to send me a sample PDF that I can use to reproduce the issue, that would help immensely (you mentioned this as a possibility in a previous comment). Feel free to email it directly to me (andrew.dillon.j@gmail.com) if you're unable to share it publicly.

digi-chris commented 5 years ago

Hi @Hopding, no worries, I've been pretty busy as well!

The qpdf check was on the original file - I've now run it again on the version saved from pdf-lib and I get the following output:

WARNING: pdfout.pdf: reported number of objects (16) inconsistent with actual number of objects (25)
checking pdfout.pdf
PDF Version: 1.7
File is not encrypted
File is not linearized

This seems to tie up with my results through PDFAssembler, which showed that the font information has gone missing. Interestingly, the metadata states that the 'PDF Producer' was GPL Ghostscript 9.06.

Thanks for the contact address, I'll send you over one of the files I'm having trouble with :)

Hopding commented 5 years ago

@digi-chris I finally had a chance to take a look at the PDF you shared with me. After a bit of debugging, I discovered that the PDF contains some comments. Like qpdf --check states, the modified PDF output by pdf-lib is missing some objects. I discovered that each missing object is preceded by a comment in the original PDF. If I manually remove the comments from the original PDF in a text editor, then pdf-lib can parse & save the PDF without any errors.

Comments are part of the PDF standard:

7.2.3 Comments Any occurrence of the PERCENT SIGN (25h) outside a string or stream introduces a comment. The comment consists of all characters after the PERCENT SIGN and up to but not including the end of the line, including regular, delimiter, SPACE (20h), and HORZONTAL TAB characters (09h). A conforming reader shall ignore comments, and treat them as single white-space characters. That is, a comment separates the token preceding it from the one following it.

EXAMPLE
The PDF fragment in this example is syntactically equivalent to just the tokens abc and 123.

    abc% comment ( /%) blah blah blah 
    123

Comments (other than the %PDF–n.m and %%EOF comments described in 7.5, "File Structure") have no semantics. They are not necessarily preserved by applications that edit PDF files.

When implementing pdf-lib's parser, I made a note to strip out comments when parsing. My initial implementation actually interfered with stream parsing. Since I'd never seen a real-life PDF that contained PDFs at that point in time, I decided to implement the parser without stripping comments.

But I've now come across a real PDF document that does contain comments :wink:. So I guess now is the time to update the parser to strip out comments when parsing.

Hopding commented 5 years ago

@digi-chris I just published a new pre-release: 0.4.1-rc1. It contains a fix for this issue. Please try it out when you get a chance and let me know if it works for you.

You can install this pre-release with npm:

npm install pdf-lib@0.4.1-rc1

It's also available on unpkg:

(This is the branch for the fix, if you're interested: https://github.com/Hopding/pdf-lib/tree/IgnoreCommentsWhenParsing)

digi-chris commented 5 years ago

@Hopding this is fantastic - it seems to be working really well. I tested with the failing PDF files and they load and save fine now, so I added some code to place an image on the page as well, so that I was definitely making changes. Again, this worked perfectly.

Thanks so much for your help on this!

Hopding commented 5 years ago

Fantastic! That's great to hear @digi-chris. I'll go ahead and cut an official 0.4.1 release with this fix today.

Hopding commented 5 years ago

Version 0.4.1 is now published. It contains the parser bug fixes for this issue. The full release notes are available here.

You can install this new version with npm:

npm install pdf-lib@0.4.1

It's also available on unpkg: