gdelugre / origami

Origami is a pure Ruby library to parse, modify and generate PDF documents.
GNU Lesser General Public License v3.0
325 stars 110 forks source link

Invalid xref stream for lazy: true #79

Open fulf opened 3 years ago

fulf commented 3 years ago

Ruby: 2.5.1 Origami: 2.1.0

When trying to read some PDFs with lazy: true, the parser raises an exception and stops. The same PDFs are read without a problem with lazy: false and no errors are indicated.

Origami::PDF.read(pdf_content_stream, lazy: true, verbosity: Origami::Parser::VERBOSE_TRACE)
[info ] ...Reading header...
[error] Breaking on: "\xBF\xBD\xEF\xBF\xBD\x04|\r\xEF\xBF..." at offset 0x3445c
[error] Last exception: [Origami::InvalidObjectError] Object shall begin with '%d %d obj' statement
[debug] Skipping this indirect object.
[trace] Read Stream object, 33 0 R
Origami::Parser::ParsingError: Invalid xref stream
from /.rvm/gems/ruby-2.5.1/gems/origami-2.1.0/lib/origami/parsers/pdf/lazy.rb:159:in `parse_revision_from_xrefstm'

I've managed to trace the error to the fact that in the snippet below, parse_object fails on its first attempt, logging the two [error]s, and then successfully returns a Origami::Stream object. Of course Origami::Stream != Origami::XRefStream so the exception is raised. But an interesting thing is that XrefStream < Stream.

# lib/origami/parsers/pdf/lazy.rb:157
def parse_revision_from_xrefstm(revision)
                xrefstm = parse_object
                raise ParsingError, "Invalid xref stream" unless xrefstm.is_a?(XRefStream)
# ...

I don't know much about PDF files, so I don't know if this is working as intended, or not. In any case, what solutions would there be to properly reading the file? Any ones more proper than below?

begin
  Origami::PDF.read(pdf_content_stream, lazy: true)
rescue Origami::Parser::ParsingError
  Origami::PDF.read(pdf_content_stream, lazy: false)
end