boazsegev / combine_pdf

A Pure ruby library to merge PDF files, number pages and maybe more...
MIT License
735 stars 157 forks source link

Removes stream_length as it returns a hash instead of an integer and … #61

Closed Reyko closed 8 years ago

Reyko commented 8 years ago

I m not sure what is the difference between stream_length and actual_length, but at the moment stream_length returns a hash:

object[:Length] => {:is_reference_only=>true, :indirect_generation_number=>0, :indirect_reference_id=>6}

This causes an exception as it is doing the comparison:

if actual_length < stream_length

The result of this is to get rescued and return

File is encrypted - not supported.

boazsegev commented 8 years ago

Hi @Reyko and thank you for this pull request.

Reading through this code, it's clear there are some things we should consider fixing, but I doubt if this is the best way to go about them.

Allow me to explain:

  1. object[:raw_stream_content].length was a mistake in the original code, we should be looking at bytesize rather then length.
  2. PDF files contain a dictionary entry with the number of bytes that should be present in the stream (which means that the stream might contain a a different number of bytes, the tail being ignored, and that would indicate a possible error)...

    stream_length should return the registered length (the dictionary entry) in bytes (Fixnum) and actual_length should indicate the actual bytesize of the stream. It's used by the gem to warm about possible errors.

  3. PDF data might be stored as a reference rather then a value, this is where actual_object and actual_value come in (one is used when looking for a referenced dictionary and the other when looking for a referenced value).

    actual_object resolves pointers and returns the actual container of the data... which is where my code when wrong. actual_value is what I wanted, it resolves pointers to the value rather then the value container

I'm not sure if removing the error check is the best fix.

I think a better approach might be changing this:

stream_length = actual_object(object[:Length])
actual_length = object[:raw_stream_content].length

to:

stream_length = actual_value(object[:Length])
actual_length = object[:raw_stream_content].bytesize
if !stream_length # it's a required entry, but it might be missing
    warn "PDF error, required stream length data is missing. Attempting to fix."
    stream_length ||= actual_length 
end

If you agree, could you update the pull request so that I can merge the fix?

Thanks again for pointing out this issue! Bo.

Reyko commented 8 years ago

Hey @boazsegev thanks for getting back to me :)

I understand your suggestion and I tried implementing it but actual_value still returns a hash and not a Fixnum.

Maybe is it something wrong with the actual_value?

I have added a test which can demonstrate the problem if you run it with the stream_length.

Thanks again

boazsegev commented 8 years ago

This could be specific to the PDF file you're working on - any chance I could see it?

Reyko commented 8 years ago

It is inside the repo in the fixtures encrypted.pdf

boazsegev commented 8 years ago

Using the file, I found the source of the issue.

The Decrypt module is called before the PDF objects are all linked together (before reference mapping). At that point, only the root object was mapped, meaning the actual_value and actual_object methods cannot perform their jobs properly for any other object.

Also, there is the issue that the length of the stream is stored as a reference which the Decrypt module cannot find (since mapping has yet to occur).

I can't move the mapping process before the decryption is complete, since object streams might contain more objects and mapping the references before the decryption might cause errors during the reference object linkage.

This might take me some time, but I'm working on a solution.

One option is to scan specifically for any missing objects, so that instead of linking the objects we are just locating the few objects we need for the decryption.

Reyko commented 8 years ago

Nice one @boazsegev, I m looking at it as well to see if I can help.

Cheers

boazsegev commented 8 years ago

I think the new version fixes the issue - could you check that it works correctly for you?

If it's working correctly, would it be okay to close the pull request?

Reyko commented 8 years ago

Hey,

Unfortunately, I m on my annual leave and I have no access to a computer. However, it is ok to close the pull request, I m gonna check if it works once I m back and if it doesn't I could email you if that's ok.

Thanks a lot for fixing this!!!

boazsegev commented 8 years ago

Have fun on your vacation :-)