Closed earthlingworks closed 1 year ago
I can confirm:
$ hp ins gh227-hexapdf_error.pdf po 24
399 0 obj
<<
/Type /Page
/Parent 1 0 R
/Contents 398 0 R
>>
endobj
So page 24 is missing the /MediaBox
key and that key is also not found in the parent page tree node.
Weird but yeah, seems to happen with some docs (most are ok). Best for us to catch the error and fallback to a different container for page dimensions?
So most viewers I tried this document with just use a default page size in case of the missing /MediaBox
entry. That default page size usually seems to be A4. When using Adobe Reader it complains about a problem when viewing that page and the page size itself is tiny, something like 10pt x 10pt or so.
So one remedy would be to just say: If the page doesn't have a valid page size, we just assume the following one (e.g. A4) and hope for the best. In your example file not all the text is visible when this strategy is employed but there is at least no error.
Weird but yeah, seems to happen with some docs (most are ok). Best for us to catch the error and fallback to a different container for page dimensions?
What do you mean by container? Another page box? But that wouldn't work in this case since this page has no page boxes defined.
You could also check the pages around the afflicted page and determine the page size this way. I.e. if the prior and next pages have the same dimensions, in most cases the page in the middle should have the same dimensions.
Ok, got it. Yeah, I meant another box. If the other documents also don't have any page boxes defined, then it sounds like our best bet would be to do as you suggest and default to the other pages around it. Thanks!
If it helps, I could add some validation code that checks for a missing /MediaBox
and sets it to some pre-defined value. Currently, validating page 24 yields two errors, one for the missing /MediaBox
and one for the missing /Resources
key.
Oh yeah, anything like that would be awesome!
This will be in the next release that's coming shortly.
So, just after I implemented the change I saw that, actually, the page 24 with object ID 399,0 has a different error because its parent should be the object 400,0 and not 1,0! And that 400,0 has a /MediaBox entry and is also referenced from the root page tree node but similar to 399,0 has a missing /Parent entry.
So doing the following will actually correct the page tree and result in correct output, even without this new change:
document = HexaPDF::Document.open(ARGV[0])
document.validate
document.pages.each_with_index { |page, i| p "Index: #{i}, width: #{page.box.width}" }
Generally, running doc.validate
before anything if you expect to handle invalid documents will correct some issues (like in this case).
Ok great, good to know and thank you!! We'll give that a try.
We seem to be having an issue with certain PDFs where the page box information is missing. I'll send in the PDF separately.
You can reproduce two ways:
pdfinfo -box -l 46 PATH
or