anicholson / origami-pdf

Automatically exported from code.google.com/p/origami-pdf
GNU Lesser General Public License v3.0
0 stars 0 forks source link

Created ruby script from PDF, script produces error #24

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Here is the full "test script" using the latest fetch:
--------------------

hg clone https://code.google.com/p/origami-pdf/
cd origami-pdf/
rake
cd ..
curl 'http://www.ada.gov/hospcombrprt.pdf' -o hospcombrprt.pdf
origami-pdf/bin/pdf2ruby -x hospcombrprt.pdf
mv hospcombrprt.pdf hospcombrprtORIG.pdf
cd hospcombrprt
ruby hospcombrprt.rb # THIS STEP PRODUCES ERRORS
bc hospcombrprt.pdf ../hospcombrprtORIG.pdf || echo FAILED

-----------------------------

EXPECTED: 
Two files are identical

ACTUAL:
/Users/williamentriken/Developer/origami-pdf/lib/origami/page.rb:75:in `pages': 
Invalid page tree (Origami::InvalidPDFError)
    from /Users/williamentriken/Developer/origami-pdf/lib/origami/pdf.rb:689:in `compile'
    from /Users/williamentriken/Developer/origami-pdf/lib/origami/pdf.rb:233:in `save'
    from hospcombrprt.rb:189:in `<main>'

Original issue reported on code.google.com by fulldec...@gmail.com on 28 Jun 2014 at 3:59

GoogleCodeExporter commented 9 years ago
Confirming still having this issue on the latest version

Original comment by fulldec...@gmail.com on 18 Nov 2014 at 10:14

GoogleCodeExporter commented 9 years ago
It might not be an error, rather then an unsupported PDF version.

The http://www.ada.gov/hospcombrprt.pdf file is encrypted with a type 4 
encryption, which according to the PDF standard, starting with PDF 1.5, is:

"(PDF 1.5) The security handler defines the use of encryption and decryption in 
the document, using the rules specified by the CF, StmF, and StrF entries."

The encryption uses AES v.2, which is limited to PDF 1.6 and above:

"AESV2 (PDF 1.6) The application shall ask the security handler for the 
encryption key and shall implicitly decrypt data with "Algorithm 1: Encryption 
of data using the RC4 or AES algorithms", using the AES algorithm in Cipher 
Block Chaining (CBC) mode with a 16-byte block size and an initialization 
vector that shall be randomly generated and placed as the first 16 bytes in the 
stream or string."

So, Even if the decryption code is written in, the way to apply that code might 
not be known due to the way the PDF file is structured.

Original comment by bse...@gmail.com on 16 Dec 2014 at 12:07

GoogleCodeExporter commented 9 years ago
Thank you for the feedback, here is an updated test case with a different kind 
of file:

hg clone https://code.google.com/p/origami-pdf/
cd origami-pdf/
rake
cd ..
curl 'http://www.irs.gov/pub/irs-pdf/p1.pdf' -o p1.pdf
origami-pdf/bin/pdf2ruby -x p1.pdf
cd p1/
ruby p1.rb 

Here is the new error:

/Users/williamentriken/Developer/origami-pdf/lib/origami/page.rb:75:in `pages': 
Invalid page tree (Origami::InvalidPDFError)
    from /Users/williamentriken/Developer/origami-pdf/lib/origami/pdf.rb:689:in `compile'
    from /Users/williamentriken/Developer/origami-pdf/lib/origami/pdf.rb:233:in `save'
    from ./p1.rb:40:in `<main>'

Original comment by fulldec...@gmail.com on 16 Dec 2014 at 8:21

GoogleCodeExporter commented 9 years ago
Interesting...

The only thing I can see is that both files contain PDF Content Streams, 
introduced in PDF version 1.5.

Also, the content streams contain objects that aren't dictionaries (less 
common)... I just fixed a similar issue in the combine_pdf gem ( 
https://github.com/boazsegev/combine_pdf ).

Maybe it's a similar issue (expecting PDF Dictionary objects in the PDF Content 
Streams, and getting a different kind of object instead)...?

Original comment by bse...@gmail.com on 20 Dec 2014 at 8:33