Closed jurriaanschrofer closed 2 years ago
The reason for the failure is invalid data inside the content stream of the page:
1.0
0
0
1.0
0.000000000000-2842171
0.00000000000-11368684
cm
The cm
operator needs 6 numbers as arguments. The first four are okay, being 1, 0, 0, and 1. However, the last two are actually invalid numbers and are interpreted as 0, -2842171, 0, and -11368684; so four numbers instead of two.
The PDF spec says "In PDF, all of the operands needed by an operator shall immediately precede that operator. Operators do not return results, and operands shall not be left over when an operator finishes execution.". So even if we just used the last 6 numbers (which would clearly be wrong as this content transformation matrix just specifies a translation), it would be invalid.
There are more than one such invalid operands.
I think that in this case we could just ignore that operation since a transformation matrix of the form 1 0 0 1 0 0
would be the identity matrix.
However, I don't know if we can just ignore - in the general case - an invalid operation. I will have to investigate a bit more.
Hi Thomas,
First of all, thanks for your quick reply.
To be quite frank, I don't know much about PDF myself. Putting PDF aside, I understand that simply ignoring similar operations in general may have unintended consequences.
But in general, I am curious how this invalid PDF format could have been created or how a valid PDF could have become invalid in this way. Do you have any educated guess?
To be quite frank, I don't know much about PDF myself. Putting PDF aside, I understand that simply ignoring similar operations in general may have unintended consequences.
I guess what viewers like Okular do is simply ignoring the invalid instruction, see this screenshot when opening the file:
In this case there probably aren't any consequences because my guess is that the operation itself is redundant.
But in general, I am curious how this invalid PDF format could have been created or how a valid PDF could have become invalid in this way. Do you have any educated guess?
That's rather easy :wink: There are myriad ways of inadvertently generating an invalid PDF document. For example, if one isn't careful when serializing numbers, one might accidentally serialize, aehm, not-an-number as 'NaN' or infinity as 'Inf' (https://github.com/gettalong/hexapdf/commit/2e3ef0cd5fddfca950b22a3f54a1dd6d746e1a1d :fearful:, also d68fd3525d040f745fa7e6229ec272a852d9652c).
So this can surely happen. Sometimes the libraries get fixed but other libraries are abusing the quite relaxed reading behaviour of PDF viewers to get away with invalid PDFs. This is problematic and the reason why there have been many amendments to HexaPDF to allow reading and processing invalid PDFs so that HexaPDF is now among the best in this regard.
I will include a fix for this in the upcoming release.
@jurriaanschrofer The fix will take the configuration option 'parser.on_correctable_error' into account, i.e. call it when an error is detected. So by default it will just ignore an invalid operation.
By the way: After correcting the document using the fix and hexapdf opt --compress-pages invalid.pdf output.pdf
the visual appearance in Okular is the same.
Hi Thomas,
Once again thanks for your quick and in-depth reply!
PDF viewers having a tolerant reading experience is both a bliss and a self-reinforcing curse I guess ;-)
Can't wait till your fix, many thanks!
Jurriaan
This is implemented and will be in the next release.
šš»
Hi Thomas,
First of all, thank you for this wonderful gem ā I have been using it for a long time with pleasure!
Last night, I have run into my first hexapdf gem error on production (hexapdf version
0.22.0
):ArgumentError: wrong number of arguments (given 9, expected 7)
The line were the crash occurred, is
lib/hexapdf/content/operator.rb line:195
Naturally, I have tried to bump hexapdf to the latest version, but alas to no avail. Also, it seems to be specific to this PDF only, since we are optimizing many PDF's in the same production context, and this is the only one that fails. I have attached the file below.
Just in case you may find it useful, I have also attached the stacktrace:
hexapdf_crash_pdf.pdf