boazsegev / combine_pdf

A Pure ruby library to merge PDF files, number pages and maybe more...
MIT License
733 stars 154 forks source link

Parsing specific PDF in 1.0.21 - RangeError: index out of range (works in 1.0.20) #205

Open Laykou opened 2 years ago

Laykou commented 2 years ago

When trying to parse this PDF _rose_production_splitpages.pdf (file was removed), we're getting error:

 RangeError:
       index out of range
      # /Users/laykou/.rvm/gems/ruby-3.1.0/gems/combine_pdf-1.0.22/lib/combine_pdf/parser.rb:364:in `pos='
     # /Users/laykou/.rvm/gems/ruby-3.1.0/gems/combine_pdf-1.0.22/lib/combine_pdf/parser.rb:364:in `_parse_'
     # /Users/laykou/.rvm/gems/ruby-3.1.0/gems/combine_pdf-1.0.22/lib/combine_pdf/parser.rb:79:in `parse'
     # /Users/laykou/.rvm/gems/ruby-3.1.0/gems/combine_pdf-1.0.22/lib/combine_pdf/pdf_public.rb:98:in `initialize'
     # /Users/laykou/.rvm/gems/ruby-3.1.0/gems/combine_pdf-1.0.22/lib/combine_pdf/api.rb:40:in `new'
     # /Users/laykou/.rvm/gems/ruby-3.1.0/gems/combine_pdf-1.0.22/lib/combine_pdf/api.rb:40:in `parse'

How we call it:

CombinePDF.parse(blob.download, allow_optional_content: true).pages

This happens on version 1.0.21 and 1.0.22 however not on 1.0.20.

Now we wanted to move to Ruby 3.1 and we need matrix fix which is in 1.0.22 but we cannot upgrade because of this failing PDF example.

Laykou commented 2 years ago

@boazsegev For some reason this fix https://github.com/boazsegev/combine_pdf/commit/b966e703fd897ff50832d3823e74791099b82ca3 broke it

boazsegev commented 2 years ago

Hi @Laykou

Thank you for opening this issue.

Please note my comments: here for issue #185 and here for issue #191.

I usually prefer lax parsers that allow formatting errors to be ignored when possible. However, issue #185 showed that a specific type of error cannot be safely ignored, which required that the parser become more strict.

I strongly suspect, from the description of the issue, that the specific PDF file is malformed.

Testing the PDF @ https://www.datalogics.com/products/pdf-tools/pdf-checker/ fails ... the testing suite doesn't even recognize the file as a PDF, not to mention listing the errors.

I have been authoring and maintaining this gem by myself for over 7 years and have been looking for a new maintainer for over 2 years. The community is enjoying my work, but not really contributing, so... 🤷🏼‍♂️ ... please forgive me for not investing more time and effort to solve this issue.

Kindly, Bo.

DimaSamodurov commented 2 years ago

Hi @boazsegev , It appears that the Length property of the stream can be incorrect in more cases than the presence of the 'endstream' keyword within the content. Anyway, preferring one over another way to extending the scanner position leads to issues. Many of these issues are acceptable for the end users, provided result looks well. E.g. swallowing the "index is out of range" error would fix the parsing of the file attached. Then it can be combined and work can be done. Can we swallow the error "index is out of range" and display warning for this case? Would such a PR make sense?

Laykou commented 3 months ago

Do you think this could be fixed in a newer version?

julitrows commented 3 months ago

Getting index out of range (RangeError) on a user uploaded PDF in version 1.0.26 as well.

mtwzim commented 3 months ago

Hi @Laykou

Thank you for opening this issue.

Please note my comments: here for issue #185 and here for issue #191.

I usually prefer lax parsers that allow formatting errors to be ignored when possible. However, issue #185 showed that a specific type of error cannot be safely ignored, which required that the parser become more strict.

I strongly suspect, from the description of the issue, that the specific PDF file is malformed.

Testing the PDF @ https://www.datalogics.com/products/pdf-tools/pdf-checker/ fails ... the testing suite doesn't even recognize the file as a PDF, not to mention listing the errors.

I have been authoring and maintaining this gem by myself for over 7 years and have been looking for a new maintainer for over 2 years. The community is enjoying my work, but not really contributing, so... 🤷🏼‍♂️ ... please forgive me for not investing more time and effort to solve this issue.

Kindly, Bo.

There are some pull requests created that could possibly solve this problem but so far they have not been merged and the problem occurs even after almost a year after PRs were submitted.

https://github.com/boazsegev/combine_pdf/pull/209 https://github.com/boazsegev/combine_pdf/pull/215

Can you take a look at them?