boazsegev / combine_pdf

A Pure ruby library to merge PDF files, number pages and maybe more...
MIT License
735 stars 157 forks source link

Unknown PDF parsing error - maleformed PDF file? #49

Closed wingleungchoi closed 8 years ago

wingleungchoi commented 8 years ago

i try the following code.

combine_pdf = CombinePDF.new
combine_pdf << CombinePDF.parse(pdf_data)

got the following error

Warning: parser advnacing for unknown reason. Potential data-loss.
Warning: parser advnacing for unknown reason. Potential data-loss.
RuntimeError: Unknown PDF parsing error - maleformed PDF file?

spec

combine_pdf 0.2.14

I doubt the pdf_data is wrong or not supported yet.

"%PDF-1.4\n1 0 obj\n<<\n/Title (\xFE\xFFbSSpg\rR\xA1)\n/Producer (wkhtmltopdf)\n/CreationDate (D:20160224104241)\n>>\nendobj\n4 0 obj\n<<\n/Type /ExtGState\n/SA true\n/SM 0.02\n/ca 1.0\n/CA 1.0\n/AIS false\n/SMask /None>>\nendobj\n5 0 obj\n[/Pattern /DeviceRGB]\nendobj\n7 0 obj\n<<\n/Type /XObject\n/Subtype /Image\n/Width 14\n/Height 99\n/BitsPerComponent 8\n/ColorSpace /DeviceRGB\n/Length 8 0 R\n/Filter /DCTDecode\n>>\nstream\n\xFF\xD8\xFF\xE0\u0000\u0010JFIF\u0000\u0001\u0001\u0001\u0000`\u0000`\u0000\u0000\xFF\xDB\u0000C\u0000\u0002\u0001\u0001\u0002\u0001\u0001\u0002\u0002\u0002\u0002\u0002\u0002\u0002\u0002\u0003\u0005\u0003\u0003\u0003\u0003\u0003\u0006\u0004\u0004\u0003\u0005\a\u0006\a\a\a\u0006\a\a\b\t\v\t\b\b\n\b\a\a\n\r\n\n\v\f\f\f\f\a\t\u000E\u000F\r\f\u000E\v\f\f\f\xFF\xDB\u0000C\u0001\  (to be continued)"
wingleungchoi commented 8 years ago

@boazsegev I send you an email about PDF file. Could you have a look on it? Please

boazsegev commented 8 years ago

I'm looking into this, I'll let you know what I find.

boazsegev commented 8 years ago

I found that at the end of the file is an html text that isn't PDF related.

Right after the last %%EOF marker is:

<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml">
<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><title>

</title></head>
<body>
    <form name="form1" method="post" action="PrintPDF.aspx?OrderNo=R800001602240028%2c&amp;type=A4" id="form1">
<div>
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUKLTQyNTgzNjAyNw9kFgICAw9kFgICAQ8WAh4JaW5uZXJodG1sBVxodHRwOi8vd3d3LnBmY2V4cHJlc3MuY29tLy9NYW5hZ2UvVXBGaWxlL1BpbnRMYWJlbC8vNjI0YWVhMmYtOTBmNy00YjdhLWJhNWMtNzRmZTA5MDY2NGI0LlpJUGRkU8G4qi0vhAYUAho7ybOeQS4PlTSmnWyqAfjeIFcilcs=" />
</div>

    <div id="div">http://www.pfcexpress.com//Manage/UpFile/PintLabel//624aea2f-90f7-4b7a-ba5c-74fe090664b4.ZIP</div>
    </form>
</body>
</html>

I'm not sure this isn't a valid PDF, although the error is ignored by other readers... I'll look into the standard to see if this is valid and I'll search for a good way to circumvent this - but this could be easier to solve on your side, it could be you're sending html data right along with the file.

boazsegev commented 8 years ago

Can you let me know if it's your application generating the PDF and if you can remove the html from the end of the PDF?

In the PDF format, multiple %%EOF markers might exist, so that CombinePDF is "correctly" attempting to parse the data after the last %%EOF... If I change CombinePDF to silently fails (like some readers), some unexpected results might go unnoticed with no exception raised.

wingleungchoi commented 8 years ago

@boazsegev thank you so much for the active responses. I download the pdf from third-party website. when i use sublime to pdf.file and remove html part, the pdf image changes.

boazsegev commented 8 years ago

Can I ask where what site generated the file? because I think it might be better to fix the issue at the site then to have a developer library ignore PDF errors that might be critical in some cases...

...I'm still debating this, because this is a question of design rather then an error.

boazsegev commented 8 years ago

I released a new version with a fix.

I tested this on a bunch of PDFs and it doesn't affect valid PDF parsing... so I guess the compatibility fix should be okay.

I hope this works for you - please let me know.

Good luck!

wingleungchoi commented 8 years ago

@boazsegev thank you so much for the new version. It works prefect. Due to privacy, i will email about website.

boazsegev commented 8 years ago

Thank :-) I'm happy it's working :+1: