Closed digi-chris closed 5 years ago
Just looking a bit further at this, I used PDFAssembler to open the original file, and the resultant file from pdf-lib.
PDFAssembler gives me the PDF structure, and I found the UZCHSP+Helvetica on the following path:
/Root -> /Pages -> /Kids[0] -> /Resources -> /Font -> /R8
On the original, there is a /FontDescriptor object. On the version saved from pdf-lib, that object is now null, meaning the /FontBBox is missing. I guess this is the problem - I just don't know how it's going missing.
Chris.
Hello @digi-chris.
pdf-lib
?pdf-lib
are you using?Can you please try saving the PDF without object streams? You can do this by setting the useObjectStreams: false
option of the PDFDocumentWriter.saveToBytes
method:
const pdfDoc = PDFDocumentFactory.load(/* a Uint8Array */);
const savedBytes = PDFDocumentWriter.saveToBytes(pdfDoc, { useObjectStreams: false });
Hi, thanks for the reply!
The file isn't that large - 415kb, and only one page.
I'm not doing anything else with it, just opening the file with pdf-lib and saving it again, using the latest version (0.4.0?).
I've tried saving it without object streams - the resultant file is slighty bigger in size, but it still has the same problem.
I've found that if I load the file in Adobe Acrobat and save it again, then it works fine in pdf-lib afterwards. So, it seems clear to me that there is something unusual about the PDF. But, the original file loads OK in everything else I've tried (Chrome, Edge, PDF.JS, PDFAssembler, Adobe Reader).
Interesting. I can't think of anything obvious that would cause this. It'll be a bit difficult to figure this out without direct access to the document. But there are a few things you can try that will give me some more info to work with:
0.3.0
of pdf-lib and see if you have the same problem. This way we can check if the issue is caused by a regression.Please also try installing qpdf
and check the validity of the document that way. You should be able to run it on Windows, Linux, or Mac. In the event that you have a Mac, you can install and run qpdf
like so:
brew install qpdf
qpdf --check path-to-your-file.pdf
Can you also please share the script that you are using to open and save the document with me? (Or if you can't share the actual script, write up a minimal working example). It's possible your code is feeding the document incorrectly to pdf-lib
.
Thanks again, here's the basic script I'm using. I cut it down to literally just opening and saving the file to make sure there wasn't any other problems:
var fs = require("fs");
var pdflib = require("pdf-lib");
var PDFDocumentFactory = pdflib.PDFDocumentFactory;
var PDFDocumentWriter = pdflib.PDFDocumentWriter;
const existingPdfDocBytes = fs.readFileSync("input.pdf");
const pdfDoc = PDFDocumentFactory.load(existingPdfDocBytes);
const pdfBytes = PDFDocumentWriter.saveToBytes(pdfDoc, { useObjectStreams: false });
fs.writeFileSync("pdfout.pdf", pdfBytes);
console.log("Done.");
I also ran the PDF through the validator - it comes back stating:
The document does conform to the PDF 1.3 standard.
I'll try switching to 0.3.0 and using qpdf next :)
Thanks for the help on this. I might be able to send the file directly to you, but I'll do these other tests first.
What version of Node are you using? If it's below 7.4.0, you'll have to use the Buffer.from
method when saving the bytes to a file (see: https://github.com/Hopding/pdf-lib/issues/16#issuecomment-405236116). (I suspect this is not your problem though, because from your description, it sounds like the PDF partially loads?)
I'm using version 8.12.0 - yes, the PDF does partially load. After loading and saving through pdf-lib
, it seems like only some of the font data has disappeared - the file still actually loads in Adobe Reader, and displays all the text, it just gives me that error I mentioned in the first post.
Other PDFs load and save fine, but I have a whole batch of PDFs that I think must have been saved with the same software (not sure what yet) that display this weird error.
I've tried checking the file with qpdf
now, and it also seems to work - I get the following output:
PDF Version: 1.3
File is not encrypted
File is not linearized
No syntax or stream encoding errors found; the file may still contain
errors that qpdf cannot detect
I've also rolled back to 0.3.0, and the issue still exists. Really strange!
Hello @digi-chris. Sorry for not having responded these past few days, I've been pretty busy. But I haven't forgotten this issue, and I'd still very much like to get to the bottom of this and fix it.
Is the output you shared from qpdf
the result of running --check
on the initial version of the document, or the version saved by pdf-lib
? Whichever it is, can you please share the results of checking both?
Can you please try checking which software produced (and last modified) the PDF file? You should be able to find this as metadata of the document (note that not all documents will contain this information).
Also, if you are able to send me a sample PDF that I can use to reproduce the issue, that would help immensely (you mentioned this as a possibility in a previous comment). Feel free to email it directly to me (andrew.dillon.j@gmail.com) if you're unable to share it publicly.
Hi @Hopding, no worries, I've been pretty busy as well!
The qpdf
check was on the original file - I've now run it again on the version saved from pdf-lib
and I get the following output:
WARNING: pdfout.pdf: reported number of objects (16) inconsistent with actual number of objects (25)
checking pdfout.pdf
PDF Version: 1.7
File is not encrypted
File is not linearized
This seems to tie up with my results through PDFAssembler, which showed that the font information has gone missing. Interestingly, the metadata states that the 'PDF Producer' was GPL Ghostscript 9.06
.
Thanks for the contact address, I'll send you over one of the files I'm having trouble with :)
@digi-chris I finally had a chance to take a look at the PDF you shared with me. After a bit of debugging, I discovered that the PDF contains some comments. Like qpdf --check
states, the modified PDF output by pdf-lib
is missing some objects. I discovered that each missing object is preceded by a comment in the original PDF. If I manually remove the comments from the original PDF in a text editor, then pdf-lib
can parse & save the PDF without any errors.
Comments are part of the PDF standard:
7.2.3 Comments Any occurrence of the PERCENT SIGN (25h) outside a string or stream introduces a comment. The comment consists of all characters after the PERCENT SIGN and up to but not including the end of the line, including regular, delimiter, SPACE (20h), and HORZONTAL TAB characters (09h). A conforming reader shall ignore comments, and treat them as single white-space characters. That is, a comment separates the token preceding it from the one following it.
EXAMPLE
The PDF fragment in this example is syntactically equivalent to just the tokens abc and 123.abc% comment ( /%) blah blah blah 123
Comments (other than the %PDF–n.m and %%EOF comments described in 7.5, "File Structure") have no semantics. They are not necessarily preserved by applications that edit PDF files.
When implementing pdf-lib
's parser, I made a note to strip out comments when parsing. My initial implementation actually interfered with stream parsing. Since I'd never seen a real-life PDF that contained PDFs at that point in time, I decided to implement the parser without stripping comments.
But I've now come across a real PDF document that does contain comments :wink:. So I guess now is the time to update the parser to strip out comments when parsing.
@digi-chris I just published a new pre-release: 0.4.1-rc1
. It contains a fix for this issue. Please try it out when you get a chance and let me know if it works for you.
You can install this pre-release with npm:
npm install pdf-lib@0.4.1-rc1
It's also available on unpkg:
(This is the branch for the fix, if you're interested: https://github.com/Hopding/pdf-lib/tree/IgnoreCommentsWhenParsing)
@Hopding this is fantastic - it seems to be working really well. I tested with the failing PDF files and they load and save fine now, so I added some code to place an image on the page as well, so that I was definitely making changes. Again, this worked perfectly.
Thanks so much for your help on this!
Fantastic! That's great to hear @digi-chris. I'll go ahead and cut an official 0.4.1
release with this fix today.
Hi,
I've got a complex PDF file that doesn't save correctly when run through pdf-lib - although it does load OK and renders with other libraries. If I simply load the file and then save it again through pdf-lib, the resultant file doesn't render correctly in Chrome's PDF viewer (it seems to stop drawing before everything is on-screen), and if I open it in Adobe Reader, I get the following error:
The PDF does contain the Helvetica font, but it doesn't contain a font called 'UZCHSP+Helvetica', so I'm not sure where this is coming from.
If I open the file in Adobe Acrobat and resave it, the resultant file is a bit smaller and then pdf-lib processes it just fine.
Unfortunately, I'm not 100% sure if I can share the PDF file publicly as it is copyrighted, so currently I'm just looking for any pointers or tips as to where I could look to find the problem - and I'll share my results here.
I was thinking maybe if I could find a list of fonts used in the PDF via pdf-lib, I could then also look up the bounding boxes for the font glyphs and that might shed some light on the matter. But, I can't work out how to see the fonts that are held in the file. Any ideas?
Thanks,
Chris.