boazsegev / combine_pdf

A Pure ruby library to merge PDF files, number pages and maybe more...
MIT License
735 stars 157 forks source link

Unknown PDF parsing error - maleformed PDF file? #58

Closed wingleungchoi closed 8 years ago

wingleungchoi commented 8 years ago

When I tried to combine pdf and i got the following error. I will sent your an email for pdf file. Warning: parser advnacing for unknown reason. Potential data-loss. RuntimeError: Unknown PDF parsing error - maleformed PDF file? from /Users/WLCHOI/.rvm/gems/ruby-2.2.3/gems/combine_pdf-0.2.16/lib/combine_pdf/parser.rb:80:in `parse'

Do you have any ideas behind it?

boazsegev commented 8 years ago

Hi Wing Leung,

Thank you for reporting this issue and sending me a file to test with. 👍

The Warning: parser advnacing for unknown reason. Potential data-loss. isn't an error - but it indicates that there is a problem in the PDF file.

I don't know what the error is, but I will try to look into it soon. I'm rather busy, so it might take a while.

I noticed this PDF file was created with a very old PDF library (iText 2.1.0 was new in 2008), I'm not sure this is related, but I will look.

Again, thank you very much. I will keep you updated. Bo.

boazsegev commented 8 years ago

Hi Wing Leung,

I looked just a little into the file and noticed that it is corrupted.

The PDF file starts with an HTTP header:

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Accept-Ranges: bytes
ETag: W/"284992-1461054274000"
Last-Modified: Tue, 19 Apr 2016 08:24:34 GMT
Content-Type: application/pdf
Content-Length: 284992
Date: Tue, 19 Apr 2016 08:24:34 GMT

I'm guessing that most PDF readers ignore the corrupted data, but CombinePDF doesn't ignore the issue and it's letting you know that the PDF is malformed.

This is actually an issue with the server sending (or saving) the file, not with CombinePDF.

Maybe later I will try to write a workaround and see what I can do, but I have to run for now (I have a test at school today).

Thanks, Bo.

wingleungchoi commented 8 years ago

Hi Bo,

Thank you for your help. I will report label creator to weird data HTTP header in the pdf file. Good Luck on your test ;-)

Many Thanks, WingLeung

wingleungchoi commented 8 years ago

Hi @boazsegev ,

I also found another pdf file faced the same error.

Warning: parser advnacing for unknown reason. Potential data-loss.
RuntimeError: Unknown PDF parsing error - maleformed PDF file?
from /Users/WLCHOI/.rvm/gems/ruby-2.2.3/gems/combine_pdf-0.2.16/lib/combine_pdf/parser.rb:80:in `parse'

Just sent an email with it. Many Thanks, 🙏 WingLeung

boazsegev commented 8 years ago

Hi WingLeung,

Thank you for sending me the second PDF file.

I can't seem to replicate the error for the second file, on my system.

I noticed you are using version 0.2.16.

Can you try opening the second file using the latest version, 0.2.21?

wingleungchoi commented 8 years ago

Hi Bo,

I tried to use the but faced the similar error

pdf_url = "pdf_url"
pdf_data = Base64.encode64(open(pdf_url).read).force_encoding('UTF-8')
CombinePDF.parse(Base64.decode64(pdf_data).force_encoding('UTF-8'))
#===========
Warning: parser advnacing for unknown reason. Potential data-loss.
Warning: parser advnacing for unknown reason. Potential data-loss.
Warning: parser advnacing for unknown reason. Potential data-loss.
Warning: parser advnacing for unknown reason. Potential data-loss.
Warning: parser advnacing for unknown reason. Potential data-loss.
PDF is Encrypted! Attempting to decrypt - not yet fully supported.
Data raising exception:
 {:StrF=>:StdCF
 :CF=>{:StdCF=>{:CFM=>:AESV2
 :AuthEvent=>:DocOpen
 :Length=>16}}
 :StmF=>:StdCF
 :U=>"\n\xBF\x1F\xCF\x9DU\xC0B\xB7dS\x84\x80\xB0\xD4\x9B\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
 :Length=>128
 :V=>4
 :O=>"\xE0\xE6\xD0M[v\\b\v\x9F\x8A\xC8\xB9\x0Fe^\xAAC\xDD\xD5\xA93\xA4\xF3\xD3LL]W\xC7\x1F\xEC"
 :P=>-1836
 :Filter=>:Standard
 :R=>4
 :indirect_generation_number=>0
 :indirect_reference_id=>26}
Data raising exception:
 {:StrF=>:StdCF
 :CF=>{:StdCF=>{:CFM=>:AESV2
 :AuthEvent=>:DocOpen
 :Length=>16}}
 :StmF=>:StdCF
 :U=>"\n\xBF\x1F\xCF\x9DU\xC0B\xB7dS\x84\x80\xB0\xD4\x9B\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
 :Length=>128
 :V=>4
 :O=>"\xE0\xE6\xD0M[v\\b\v\x9F\x8A\xC8\xB9\x0Fe^\xAAC\xDD\xD5\xA93\xA4\xF3\xD3LL]W\xC7\x1F\xEC"
 :P=>-1836
 :Filter=>:Standard
 :R=>4
 :indirect_generation_number=>0
 :indirect_reference_id=>26}
RuntimeError: File is encrypted - not supported.
from /Users/WLCHOI/.rvm/gems/ruby-2.2.3/gems/combine_pdf-0.2.21/lib/combine_pdf/decrypt.rb:178:in `raise_encrypted_error'
boazsegev commented 8 years ago

Are you sure you sent me the correct file? I tried the code and I didn't get any errors...

This was the code I tried in the terminal:

require 'base64'
pdf_url = File.expand_path "~/Desktop/ESHK8977968_label.pdf"
pdf_data = Base64.encode64(open(pdf_url).read).force_encoding('UTF-8'); nil
pdf = CombinePDF.parse(Base64.decode64(pdf_data).force_encoding('UTF-8')); nil
pdf.save "1.pdf"
wingleungchoi commented 8 years ago

You are right. The PDF file itself work prefect with the gem. I thought it might be due to PDF_url have pop window to print. let me send you the url link

boazsegev commented 8 years ago

This isn't an error... the PDF in the URL is encrypted using an unsupported AESv2 encryption.

PDF 1.5 files support custom encryption algorithms, which is (I am sad to say) something I don't know much about.

I wrote very basic encryption/decryption support for the library. At some point I started writing the more complicated AES decryption support...

... but AES was more complicated then I expected and all the encrypted files I found for testing used the RC4 encryption, so I couldn't test... so I just gave up.

If you know someone who can expend the decryption support, that would be great, but for now, I am sad to say that this PDF isn't supported by the library because of it's encryption.

wingleungchoi commented 8 years ago

interesting :+1:, i will be looking for more detail about AESv2 encryption.

wingleungchoi commented 8 years ago

@boazsegev I contacted the PDF author. He said the reason behind encryption is due to security concern. He would like to ensure no one plays the pdf files. In this way, shall we not decrypt it? I think this issue should be closed.

boazsegev commented 8 years ago

I'm closing the issue, as it's inactive.

jordan-allan commented 8 years ago

I am receiving a similar error when combining a PDF of version 1.2 — the same PDF of version 1.3 works correctly. I have emailed you supporting documents.

Warning: parser advnacing for unknown reason. Potential data-loss.

RuntimeError (Unknown PDF parsing error - maleformed PDF file?):
lib/merge_pdf.rb:26:in `block in combine'
lib/merge_pdf.rb:25:in `each'
lib/merge_pdf.rb:25:in `combine'
lib/merge_pdf.rb:18:in `call'
boazsegev commented 8 years ago

Hi Jordan,

Thanks for opening this issue.

After opening the v.1.2 file you sent me (in my code editor), I noticed it has an invalid PDF header...

Each PDF starts with a comment line indicating it's version. Comment lines start with a %

The v.1.2 file shows a valid version indicator: %PDF-1.2

Next should come a comment line indicating if the PDF should be read as a binary or text file. It starts with a comment as well and for binary files contains non-text data.

The v.1.2 file shows a valid text pdf file indicator (non-binary, shouldn't be sent in emails, as email machines might add new-line breaks): %dhi9hklfrp25

Next should come the PDF data, starting with PDF object in the format: ### ### obj (i.e. 10 0 obj) ... However, the v.1.2 file reads a stream of binary data that is non-PDF compliant: ??[DLE][NUL][EOT][...]3?[...]

In theory, a patch could be written to attempt and ignore unknown data... however, as a design decision, I have tried to prefer exception raising over quite failures. I believe real potential for data loss should cause the parser to fail rather then produce invalid PDF data.

I'm sorry, but I doubt I can help with this issue. If I did, it might require re-writing the PDF parser.

Again, I thank you for opening this issue and I wish you good lock!