galkahana / HummusJS

Node.js module for high performance creation, modification and parsing of PDF files and streams
http://www.pdfhummus.com
Other
1.15k stars 170 forks source link

PDF Troubleshooting - Unable to start parsing PDF file #417

Closed thmclellan closed 5 years ago

thmclellan commented 5 years ago

Thanks for this awesome library. For a small percentage of PDF's, the reader is throwing a Unable to start parsing PDF file error. (Node v10.12.0)

   const hummus = require('hummus'); 
   const pdfReader = hummus.createReader(pdfPath);
   const pageCount = pdfReader.getPagesCount();

The error is throw on createReader, which I'm calling like in https://github.com/galkahana/HummusJS/wiki/Parsing.

This happened today with a customer-related file in PDF 1.5 format. I can't share the file publicly so just sent it to you @galkahana by email. The header looks okay at a glance.

Anyway, I'm at a loss for how to troubleshoot this further - I probably need to get more hands on with the PDF spec. The file is opening okay with PDF.js, Mac Preview, etc, but it was generated with an autoCAD-type program, so maybe it's using some uncommon PDF features.

In case it helps, Adobe Acrobat Pro preflight inspector tool shows an internal structure as below. The file isn't optimized for fast web loading and doesn't seem to have any encryption or security features.

Anyway appreciate any insights or suggestions on how to troubleshoot further. Thanks

image

thmclellan commented 5 years ago

Thanks to @galkahana for troubleshooting over email!

If it helps anyone else, this corrupt PDF was a case of the startXRef not pointing to the Cross reference table (xref), and the xref itself being screwed up.

PDFWriter / Hummus assumes a valid PDF file structure while some desktop apps like Preview or Acrobat Reader are more willing to digest poorly formed PDF's.

For repairing invalid PDF's (and/or diagnosing what's broken), there's a useful tool online at https://www.pdf-online.com/osa/repair.aspx. After fixing the PDF there, Hummus was able to read it just fine.