Hopding / pdf-lib

Create and modify PDF documents in any JavaScript environment
https://pdf-lib.js.org
MIT License
6.9k stars 657 forks source link

Writing text to specific pdf seems to break the structure #78

Closed kevinswartz closed 5 years ago

kevinswartz commented 5 years ago

Hi @Hopding , I have a file here that I'm able to view without issue in pdf.js. Once I write some text to it via pdf-lib, the file can no longer be viewed in pdf.js with the error "Invalid PDF Structure". I've attached pdfs from before, and after the write. Do you have any ideas about ways to write text differently so this doesn't happen? These files are non-production. Thanks again! file_before.pdf file_after.pdf

kevinswartz commented 5 years ago

I have another file with the same problem (I think). Attaching it here! before.pdf after.pdf

Edit: I was using v0.6.1-rc4 when I generated this file

Hopding commented 5 years ago

Hello @kevinswartz.

I took a look at this today. Something about the source document seems to be causing pdf-lib to miscalculate the offsets for the cross-reference table. This is likely a bug in the PDFDocumentWriter, which means the problem will arise just by opening and saving the document with pdf-lib - whether you make any modifications or not.

You can sort of work around the problem by saving the document without object streams:

// With Object Streams
PDFDocumentWriter.saveToBytes(pdfDoc);

// Without Object Streams
PDFDocumentWriter.saveToBytes(pdfDoc, { useObjectStreams: false });

Acrobat was able to open the documents you shared after saving with useObjectStreams: false.

Of course, this doesn't actually fix the bug. So I'll continue looking into this and let you know what I find.

kevinswartz commented 5 years ago

Thanks @Hopding , I can confirm that this fixes what I was seeing with both of these files. Are there any other consequences to save with useObjectStreams: false? What is that really doing? Thanks!

Hopding commented 5 years ago

@kevinswartz The only real benefit to using object streams is that it makes the resulting PDF file a bit smaller. Many PDF libraries don't support object streams at all, and only write PDFs without them.

PDF files contain a structure known as a Cross Reference Table (since PDF v1.0). This table contains pointers (byte offsets) of each object in the document. This allows for fast random access to objects in large PDF files. These tables tend to get corrupted a lot, so most readers are able to reconstruct them without any perceptible change in the reader's performance.

However, if the file is saved with object streams, then Cross Reference Streams are used instead of Cross Reference Tables. Cross Reference Streams were introduced in a later PDF version (v1.6, I think). For whatever reason, not as many readers are able to reconstruct corrupted Cross Reference Streams (e.g. Google Chrome can, but Mac's Preview and Adobe Acrobat apparently cannot).

kevinswartz commented 5 years ago

Thanks @Hopding! Good information. We might start not using object streams if it means better compatibility.

Hopding commented 5 years ago

Hello there @kevinswartz!

I was able to find and fix the issue causing this in https://github.com/Hopding/pdf-lib/pull/101. Some of the logic used to write out the cross reference tables and streams was incorrect. In particular, the code assumed that all PDFs would have an object with an ID of 1. This resulted in offset miscalculations if, in fact, no such object existed.

I just cut prerelease 0.6.2-rc3 with the fix.

You can install this prerelease with npm:

npm install pdf-lib@0.6.2-rc3

It's also available on unpkg:

Please try it out and let me know if it works for you!

kevinswartz commented 5 years ago

Thanks! I'll check it out.

Hopding commented 5 years ago

Version 0.6.2 is now published. It contains fix for this issue. The full release notes are available here.

You can install this new version with npm:

npm install pdf-lib@0.6.2

It's also available on unpkg:

(@kevinswartz if you find that you're still having trouble with this after using the new release, please go ahead and reopen this issue.)

kevinswartz commented 5 years ago

Thanks @Hopding! Looks like that fixes the issue.