ArgumentError: wrong number of arguments (given 9, expected 7)

jurriaanschrofer commented 2 years ago

Hi Thomas,

First of all, thank you for this wonderful gem – I have been using it for a long time with pleasure!

Last night, I have run into my first hexapdf gem error on production (hexapdf version 0.22.0):

ArgumentError: wrong number of arguments (given 9, expected 7)

The line were the crash occurred, is lib/hexapdf/content/operator.rb line:195

Naturally, I have tried to bump hexapdf to the latest version, but alas to no avail. Also, it seems to be specific to this PDF only, since we are optimizing many PDF's in the same production context, and this is the only one that fails. I have attached the file below.

Just in case you may find it useful, I have also attached the stacktrace:

/ruby/3.0.0/gems/hexapdf-0.16.0/lib/hexapdf/content/operator.rb line 195 in serialize
/ruby/3.0.0/gems/hexapdf-0.16.0/lib/hexapdf/task/optimize.rb line 238 in process
/ruby/3.0.0/gems/hexapdf-0.16.0/lib/hexapdf/content/parser.rb line 192 in block in parse
/ruby/3.0.0/gems/hexapdf-0.16.0/lib/hexapdf/content/parser.rb line 186 in loop
/ruby/3.0.0/gems/hexapdf-0.16.0/lib/hexapdf/content/parser.rb line 186 in parse
/ruby/3.0.0/gems/hexapdf-0.16.0/lib/hexapdf/content/parser.rb line 166 in parse
/ruby/3.0.0/gems/hexapdf-0.16.0/lib/hexapdf/task/optimize.rb line 220 in block in compress_pages
/ruby/3.0.0/gems/hexapdf-0.16.0/lib/hexapdf/type/page_tree_node.rb line 226 in block in each_page
/ruby/3.0.0/gems/hexapdf-0.16.0/lib/hexapdf/pdf_array.rb line 183 in block in each
/ruby/3.0.0/gems/hexapdf-0.16.0/lib/hexapdf/pdf_array.rb line 183 in each_index
/ruby/3.0.0/gems/hexapdf-0.16.0/lib/hexapdf/pdf_array.rb line 183 in each
/ruby/3.0.0/gems/hexapdf-0.16.0/lib/hexapdf/type/page_tree_node.rb line 224 in each_page
/ruby/3.0.0/gems/hexapdf-0.16.0/lib/hexapdf/document/pages.rb line 131 in each
/ruby/3.0.0/gems/hexapdf-0.16.0/lib/hexapdf/task/optimize.rb line 218 in compress_pages
/ruby/3.0.0/gems/hexapdf-0.16.0/lib/hexapdf/task/optimize.rb line 87 in call

hexapdf_crash_pdf.pdf

gettalong commented 2 years ago

The reason for the failure is invalid data inside the content stream of the page:

1.0
0
0
1.0
0.000000000000-2842171
0.00000000000-11368684
cm

The cm operator needs 6 numbers as arguments. The first four are okay, being 1, 0, 0, and 1. However, the last two are actually invalid numbers and are interpreted as 0, -2842171, 0, and -11368684; so four numbers instead of two.

The PDF spec says "In PDF, all of the operands needed by an operator shall immediately precede that operator. Operators do not return results, and operands shall not be left over when an operator finishes execution.". So even if we just used the last 6 numbers (which would clearly be wrong as this content transformation matrix just specifies a translation), it would be invalid.

There are more than one such invalid operands.

I think that in this case we could just ignore that operation since a transformation matrix of the form 1 0 0 1 0 0 would be the identity matrix.

However, I don't know if we can just ignore - in the general case - an invalid operation. I will have to investigate a bit more.

jurriaanschrofer commented 2 years ago

Hi Thomas,

First of all, thanks for your quick reply.

To be quite frank, I don't know much about PDF myself. Putting PDF aside, I understand that simply ignoring similar operations in general may have unintended consequences.

But in general, I am curious how this invalid PDF format could have been created or how a valid PDF could have become invalid in this way. Do you have any educated guess?

gettalong commented 2 years ago

To be quite frank, I don't know much about PDF myself. Putting PDF aside, I understand that simply ignoring similar operations in general may have unintended consequences.

I guess what viewers like Okular do is simply ignoring the invalid instruction, see this screenshot when opening the file:

In this case there probably aren't any consequences because my guess is that the operation itself is redundant.

But in general, I am curious how this invalid PDF format could have been created or how a valid PDF could have become invalid in this way. Do you have any educated guess?

That's rather easy :wink: There are myriad ways of inadvertently generating an invalid PDF document. For example, if one isn't careful when serializing numbers, one might accidentally serialize, aehm, not-an-number as 'NaN' or infinity as 'Inf' (https://github.com/gettalong/hexapdf/commit/2e3ef0cd5fddfca950b22a3f54a1dd6d746e1a1d :fearful:, also d68fd3525d040f745fa7e6229ec272a852d9652c).

So this can surely happen. Sometimes the libraries get fixed but other libraries are abusing the quite relaxed reading behaviour of PDF viewers to get away with invalid PDFs. This is problematic and the reason why there have been many amendments to HexaPDF to allow reading and processing invalid PDFs so that HexaPDF is now among the best in this regard.

I will include a fix for this in the upcoming release.

gettalong commented 2 years ago

@jurriaanschrofer The fix will take the configuration option 'parser.on_correctable_error' into account, i.e. call it when an error is detected. So by default it will just ignore an invalid operation.

By the way: After correcting the document using the fix and hexapdf opt --compress-pages invalid.pdf output.pdf the visual appearance in Okular is the same.

jurriaanschrofer commented 2 years ago

Hi Thomas,

Once again thanks for your quick and in-depth reply!

PDF viewers having a tolerant reading experience is both a bliss and a self-reinforcing curse I guess ;-)

Can't wait till your fix, many thanks!

Jurriaan

gettalong commented 2 years ago

This is implemented and will be in the next release.

jurriaanschrofer commented 2 years ago

👍🏻

gettalong / hexapdf

ArgumentError: wrong number of arguments (given 9, expected 7) #183