Open oscarlevin opened 10 months ago
I don't recall offhand, so I'll try removing it and seeing if the tests still pass in my codespace and on Actions
Oh my guess is that it's required by core (maybe on the command-line?). Going to dig in there while I run the test suite without it.
I think the issue is that it was a dependency of pdfcropmargins, but they had a misconfiguration of the version of pypdf2 they needed, so we hotfixed it on our end. Note pdfcropmargins is required by core: https://github.com/PreTeXtBook/pretext/blob/1eb62e945ccd1cd4b6f3268c7f5f087c40169cc8/pretext/pretext.py#L440
Oh right! Okay, unless pyMuPDF can do the same thing as pdfcropmargins (I know it can crop, but I don't think automatically), then we are good.
https://github.com/pymupdf/PyMuPDF/issues/617 <- maybe it can?
For posterity, here's where we added the pypdf2 dependency: https://github.com/PreTeXtBook/pretext-cli/issues/289
reopening as I think we should get upstream to rely on only one python pdf package, and update our dependencies here to match
Definitely worth looking into more. I don't think the link to the pyMuPDF issue does it though, as we need to detect the content; we won't know the size to crop to ahead of time. But I might be misreading the answer given.
That's possible. Then I wonder how much trouble it would be to compute the rectangle manually (might be a good upstream contribution to PyMuPDF if it's not too kludgy).
We're looking at dropping
pdf-crop-margins
as we already needPyMuPDF
for other functionality. I think I understand thatpage.setCropBox(r)
crops the page to the rectangler
. Is there any way to automatically computer
to be the smallest rectangle containing all the content on a page (e.g. so we automatically detect and crop out margins)?Yes,
page.set_cropbox()
(withpage
being aPage
object) sets the visible part of a page.It does not physically delete the part becoming invisible. Other values for that rectangle may recover these things.
To compute the smallest rectangle for anything the page has to show use
page.get_bboxlog()
in the following code snippet:rect = fitz.EMPTY_RECT() # start with the standard empty rectangle for item in page.get_bboxlog(): rect |= item[1] # join this bbox into the result # rect now wraps all page content
The advantage is, that no text or image or whatever needs to be extracted to do this.
An item of
page.get_bboxlog()
looks like this(type, (x0, y0, x1, y1))
. "type" can be "fill-text", "fill-image" and more, showing the object type. The second tuple is the boundary box.
It looks like pdfCropMargins 2.x now uses pyMuPDF anyway. But I think we tried that and it caused issues. Might require an update from core, which would be good anyway, since anyone not using the CLI would just pip-install it and get the most recent version.
Once that is confirmed, we should at least upgrade to pdfCropMargins so pyPDF isn't needed at all.
In
project.toml
we list pyPDF2 as a dependency. Do we actually use this? I wasn't able to find any import of that. I mostly ask because upstream I've added a dependency on pyMuPDF, which apparently has a superset of capabilities and is much faster than pyPDF2 (at least according to the benchmarks advertised by pyMuPDF)