cantoo-scribe / pdf-lib

Create and modify PDF documents in any JavaScript environment
https://pdf-lib.js.org
MIT License
141 stars 29 forks source link

Strange MediaBox and CropBox of PDF page #59

Closed pedromdev closed 3 months ago

pedromdev commented 3 months ago

What were you trying to do?

I'm trying to get the page boxes information to centralize an image.

How did you attempt to do it?

I tried to get information from getMediaBox() or getCropBox() methods from PDFPage object to calculate the page center position.

What actually happened?

I got strange informations about the page boxes. In some cases, height and width of a page is a negative value.

What did you expect to happen?

Get the correct information about the page boxes.

How can we reproduce the issue?

I added a comment in an old issue about the MediaBox and CropBox. I added the PDF that I tested and a piece of code.

I have a PDF whose first page has different box information than the pages. However, when I retrieve this information using pdfinfo, I get information that differs from the information that pdf-lib gives me.

image image image image

I drew some circles using the CropBox information and this is how the 2 tests turned out. The first printout was using the pdfinfo information. The second printout was using the information that pdf-lib gives me through the getCropBox() method.

image image image

How is this MediaBox and CropBox information obtained in pdf-lib?

The example PDF is below:

input2.pdf

Note: I understood later that height and width are used to calculate xEnd and yEnd of PDF page, but even if I calculate the end point I don't get the same information.

Version

2.2.0

What environment are you running pdf-lib in?

Node

Checklist

Additional Notes

I did test in both versions and I got the same results:

Sharcoux commented 3 months ago

The pdf seems malformed, but we can update pdf-lib to handle this malformation. How did this pdf get generated?

Sharcoux commented 3 months ago

Solved in @cantoo/pdf-lib: 2.2.0

Sharcoux commented 3 months ago

I would still like to know how the pdf has been generated, though.

pedromdev commented 3 months ago

Hi @Sharcoux.

The PDF I added here is a partial PDF that I created from from another just for reproduce the behavior. According to pdfinfo, the original PDF was created in Adobe InDesign CS6.

image

Sharcoux commented 3 months ago

According to the specs,

The MediaBox is defined as an array of four numbers, typically in the format [llx lly urx ury], where:

    llx: The lower-left x-coordinate.
    lly: The lower-left y-coordinate.
    urx: The upper-right x-coordinate.
    ury: The upper-right y-coordinate.

In your provided pdf, the mediabox inverted the 2 y coordinates, leading to the wrong result. So, I don't know who is the culprit during the file generation, but the file is definitely malformed.

RippleRurigaki commented 3 months ago

I have seen this issue and have looked into it.

PDF specs, Looking, 7.7.3.3 Page Object

MediaBox type is "rectangle"

Looking, 7.9.5 Rectangles

Rectangles are used to describe locations on a page and bounding boxes for a variety of objects. A rectangleshall be written as an array of four numbers giving the coordinates of a pair of diagonally opposite corners.

NOTE Although rectangles are conventionally specified by their lower-left and upper-right corners, it is acceptable to specify any two diagonally opposite corners.

I understand that it does not have to be [lower left, upper right], although that is not common.

I thought it would be easy to modify the values obtained, but I am concerned that it will not affect the other placement coordinates.

I have noticed this but have not been able to confirm it yet. At this time I do not have the time.

Sharcoux commented 3 months ago

Ok. Well, anyway, from version 2.2.1, both will be supported, so I think I'll just close this. Thanks for the clarification.