No Method Error with Problem PDF After Passing Validation

gettalong / hexapdf

Versatile PDF creation and manipulation for Ruby

https://hexapdf.gettalong.org

Other

1.21k stars 69 forks source link

No Method Error with Problem PDF After Passing Validation #274

Closed CharlieWWW94 closed 7 months ago

CharlieWWW94 commented 9 months ago

Hi,

I have recently run into an issue with a problem PDF (which I cannot share due to GDPR regs) causing a No Method Error upon writing to a HexaPDF::Document, despite passing validation.

class Merge
  def call(pdf:, string_to_merge:)
      io_stream = StringIO.new(string_to_merge)
      pdf_to_merge = HexaPDF::Document.new(io: io_stream)
      pdf_to_merge.pages.each { |page| pdf.pages << pdf.import(page) }
      pdf
  end
end

document = HexaPDF::Document.new
merge = Merge.new
merge.call(pdf: document, string_to_merge: file.read)
if document.validate # validation passes for problem pdf
  io_stream = StringIO.new
  document.write(io_stream, optimize: false)
end

For other documents the validation will fail as expected, however in the case of the problem pdf document.validate returns true and then upon writing to document the following error occurs:

NoMethodError: undefined method `reject!' for #<HexaPDF::Dictionary [2, 0] value={:ProcSet=>[:PDF, :Text, :ImageB, :ImageC, :ImageI]}>

The stack trace directs to: hexapdf-0.24.2/lib/hexapdf/type/resources.rb:226:in `perform_validation'

Apologies for being unable to share the document in question, I appreciate that it will make finding a solution/replicating the issue difficult.

gettalong commented 9 months ago

Thanks for reporting!

It is indeed more difficult to debug when not having the PDF ;-) Could you please do the following:

Run hexapdf info --check problem.pdf and provide the output before the line with "File name:..." if there is any.
Run hexapdf inspect problem.pdf pages. This will give you a list of pages and object identifiers (x,y). The page identifier is the one before the colon.
For each page identifier run hexapdf inspect problem.pdf x,y. This will output the page object for the respective page. This information should not contain any GDPR relevant information. If you see something, just replace it with something else.

Thanks for helping!

CharlieWWW94 commented 9 months ago

Thanks for getting back to me about this so quickly!

Ran hexapdf info --check problem.pdf - the first line of the output was the File name.

Here is the output from inspecting the pages:

<<
  /Type /Page
  /Parent 66 0 R
  /Resources <<
    /ProcSet 65 0 R
    /Font <<
      /F1 34 0 R
      /F2 37 0 R
      /F3 40 0 R
      /F4 42 0 R
      /F5 61 0 R
      /F6 64 0 R
    >>
    /XObject <<
      /Im1 6 0 R
      /Im2 8 0 R
      /Im3 10 0 R
      /Im4 12 0 R
    >>
  >>
  /MediaBox [0 -0.8 595.2 841 ]
  /Contents 24 0 R
>>

<<
  /Type /Page
  /Parent 66 0 R
  /Resources <<
    /ProcSet 65 0 R
    /Font <<
      /F1 34 0 R
      /F2 37 0 R
      /F5 61 0 R
      /F3 40 0 R
      /F6 64 0 R
    >>
    /XObject <<
      /Im1 6 0 R
    >>
  >>
  /MediaBox [0 -0.8 595.2 841 ]
  /Contents 26 0 R
>>

<<
  /Type /Page
  /Parent 66 0 R
  /Resources <<
    /ProcSet 65 0 R
    /Font <<
      /F1 34 0 R
      /F2 37 0 R
      /F5 61 0 R
      /F3 40 0 R
      /F6 64 0 R
    >>
    /XObject <<
      /Im1 6 0 R
      /Im5 18 0 R
    >>
  >>
  /MediaBox [0 -0.8 595.2 841 ]
  /Contents 28 0 R
>>

<<
  /Type /Page
  /Parent 66 0 R
  /Resources <<
    /ProcSet 65 0 R
    /Font <<
      /F1 34 0 R
      /F2 37 0 R
      /F5 61 0 R
      /F3 40 0 R
    >>
    /XObject <<
      /Im1 6 0 R
      /Im6 22 0 R
    >>
  >>
  /MediaBox [0 -0.8 595.2 841 ]
  /Contents 30 0 R
>>

gettalong commented 9 months ago

Thanks! Alas, that output looks fine. And no warning or error message on hexapdf info suggests that the problem PDF is most probably fine.

Does your script also fail with an error if you use document.write(io_stream, optimize: false, validate: false)?

And could you provide the output of hexapdf inspect problem.pdf X with X being each of the following: 6, 8, 10, 12, 18, 22, 65?

Thanks!

CharlieWWW94 commented 9 months ago

Sorry for the delay in getting back to you - I had to confirm the information I can send over.

The script does not fail when using document.write(io_stream, optimize: false, validate: false), all works fine.

Here is the output from running hexapdf inspect problem.pdf X - I have removed some garbled information from the output for GDPR, but have indicated where I have done so.

hexapdf inspect problem.pdf 6

<<
  /Type /XObject
  /Subtype /Image
  /Name /Im1
  /Width 227
  /Height 97
  /ColorSpace [/Indexed /DeviceRGB 255 **removed information**]
  /BitsPerComponent 8
  /Filter /FlateDecode
  /Length 6886
>>

hexapdf inspect problem.pdf 8

<<
  /Type /XObject
  /Subtype /Image
  /Name /Im2
  /Width 133
  /Height 57
  /ColorSpace [/Indexed /DeviceRGB 255 **removed information**]
  /BitsPerComponent 8
  /Filter /FlateDecode
  /Length 3157
>>

hexapdf inspect problem.pdf 10

<<
  /Type /XObject
  /Subtype /Image
  /Name /Im3
  /Width 257
  /Height 51
  /ColorSpace [/Indexed /DeviceRGB 255 **removed_information**]
  /BitsPerComponent 8
  /Filter /FlateDecode
  /Length 2100
>>

hexapdf inspect problem.pdf 12

<<
  /Type /XObject
  /Subtype /Image
  /Name /Im4
  /Width 1440
  /Height 214
  /ColorSpace /DeviceRGB
  /BitsPerComponent 8
  /Filter /FlateDecode
  /Length 6006
>>

hexapdf inspect problem.pdf 18

<<
  /Type /XObject
  /Subtype /Image
  /Name /Im5
  /Width 59
  /Height 59
  /ColorSpace /DeviceRGB
  /BitsPerComponent 8
  /Filter /FlateDecode
  /Length 542
>>

hexapdf inspect problem.pdf 22

<<
  /Type /XObject
  /Subtype /Image
  /Name /Im6
  /Width 59
  /Height 54
  /ColorSpace [/Indexed /DeviceRGB 255 **removed information** ]
  /BitsPerComponent 8
  /Filter /FlateDecode
  /Length 244
>>

hexapdf inspect problem.pdf 65

<<
  /ProcSet [/PDF /Text /ImageB /ImageC /ImageI ]
>>

Thanks again for the help!

gettalong commented 9 months ago

Thanks for the output! Oh, the infinite numbers to create partly invalid PDFs :sweat_smile:

So, we have this for a page object:

<<
  /Type /Page
  /Parent 66 0 R
  /Resources <<
    /ProcSet 65 0 R
    /Font <<
      /F1 34 0 R
      /F2 37 0 R
      /F3 40 0 R
      /F4 42 0 R
      /F5 61 0 R
      /F6 64 0 R
    >>
    /XObject <<
      /Im1 6 0 R
      /Im2 8 0 R
      /Im3 10 0 R
      /Im4 12 0 R
    >>
  >>
  /MediaBox [0 -0.8 595.2 841 ]
  /Contents 24 0 R
>>

The << and >> denote a PDF dictionary, i.e. a hash, while /NAME represents a PDF name similar to what in Ruby is a Symbol.

For the page object, there is a /Resources key that needs to hold a dictionary value. Check :heavy_check_mark:
The resources dictionary can have a /ProcSet key which in this case references another object 65 0 R. Check :heavy_check_mark:
Looking at the object with ID (65,0) we find:
```
<<
  /ProcSet [/PDF /Text /ImageB /ImageC /ImageI ]
>>
```
So that object is again a dictionary with a single /ProcSet key. Fail :x: because it should be an array, i.e. like the value of that /ProcSet key.

This is the reason of the error message.

However, what I'm not sure about is why the first call document.validate doesn't not fail since there is a type check, i.e. /ProcSet should be an array, and that should fail.

I will try to recreate the situation to find out the root cause.

Two more questions:

Is the problematic PDF one you got from an external source or is that already the result of a HexaPDF operation?
Could you run hexapdf inspect problem.pdf rev?

Thanks!

CharlieWWW94 commented 9 months ago

The problematic PDF comes from an external source, it is not the result of a HexaPDF operation.

Here is the output from hexapdf inspect problem.pdf rev

Document has 1 revision
Revision 1
  Type      : xref table
  Objects   : 67
  Size      : 68
  Byte range: 0-66996

Glad to you you've worked out why we are getting the error message!

gettalong commented 9 months ago

Thanks for the additional information!

Yes but I will also need to find out why the first validation is returning true instead of false.

gettalong commented 7 months ago

@CharlieWWW94 Sorry for the wait.

I have fixed both problems:

The first problem is that thrown error. This problem has been introduced due to a change in how HexaPDF does the validations, i.e. it tries to show as many validation problems as possible. So if one validation fails, the routine doesn't necessarily stop if later validations don't depend on the prior ones. However, this only works for validations in the same class and not if the validation failed in a superclass.

To fix this HexaPDF will now catch errors thrown during validation and change them into validation errors. There might be a better fix in the future but for now that is enough.
The second problem is that the first validation doesn't return false. The reason it works the second time is that the first validation pass resolves the /Resources key and wraps the hash into the proper dictionary subclass of HexaPDF::Type::Resources. However, it does not validate that the /Resources are actually valid. When the second pass comes around, the /Resources are validated and the problem arises.

I have fixed this by re-ordering the validation to make sure that non-indirect values are turned into proper PDF objects.

CharlieWWW94 commented 7 months ago

Amazing - thanks for the update and for sorting the issue! Much appreciated 😄