pyPDF2 dependency? - Githubissues

PreTeXtBook / pretext-cli

Command line interface for quickly creating, authoring, and building PreTeXt documents.

https://pretextbook.org

GNU General Public License v3.0

17 stars 17 forks source link

pyPDF2 dependency? #603

Open oscarlevin opened 10 months ago

oscarlevin commented 10 months ago

In project.toml we list pyPDF2 as a dependency. Do we actually use this? I wasn't able to find any import of that. I mostly ask because upstream I've added a dependency on pyMuPDF, which apparently has a superset of capabilities and is much faster than pyPDF2 (at least according to the benchmarks advertised by pyMuPDF)

StevenClontz commented 10 months ago

I don't recall offhand, so I'll try removing it and seeing if the tests still pass in my codespace and on Actions

StevenClontz commented 10 months ago

Oh my guess is that it's required by core (maybe on the command-line?). Going to dig in there while I run the test suite without it.

StevenClontz commented 10 months ago

I think the issue is that it was a dependency of pdfcropmargins, but they had a misconfiguration of the version of pypdf2 they needed, so we hotfixed it on our end. Note pdfcropmargins is required by core: https://github.com/PreTeXtBook/pretext/blob/1eb62e945ccd1cd4b6f3268c7f5f087c40169cc8/pretext/pretext.py#L440

oscarlevin commented 10 months ago

Oh right! Okay, unless pyMuPDF can do the same thing as pdfcropmargins (I know it can crop, but I don't think automatically), then we are good.

StevenClontz commented 10 months ago

https://github.com/pymupdf/PyMuPDF/issues/617 <- maybe it can?

StevenClontz commented 10 months ago

For posterity, here's where we added the pypdf2 dependency: https://github.com/PreTeXtBook/pretext-cli/issues/289

StevenClontz commented 10 months ago

reopening as I think we should get upstream to rely on only one python pdf package, and update our dependencies here to match

oscarlevin commented 10 months ago

Definitely worth looking into more. I don't think the link to the pyMuPDF issue does it though, as we need to detect the content; we won't know the size to crop to ahead of time. But I might be misreading the answer given.

StevenClontz commented 10 months ago

That's possible. Then I wonder how much trouble it would be to compute the rectangle manually (might be a good upstream contribution to PyMuPDF if it's not too kludgy).

StevenClontz commented 10 months ago

We're looking at dropping pdf-crop-margins as we already need PyMuPDF for other functionality. I think I understand that page.setCropBox(r) crops the page to the rectangle r. Is there any way to automatically compute r to be the smallest rectangle containing all the content on a page (e.g. so we automatically detect and crop out margins)?

Yes, page.set_cropbox() (with page being a Page object) sets the visible part of a page.

It does not physically delete the part becoming invisible. Other values for that rectangle may recover these things.

To compute the smallest rectangle for anything the page has to show use page.get_bboxlog() in the following code snippet:
rect = fitz.EMPTY_RECT()  # start with the standard empty rectangle
for item in page.get_bboxlog():
    rect |= item[1]  # join this bbox into the result
# rect now wraps all page content
The advantage is, that no text or image or whatever needs to be extracted to do this.

An item of page.get_bboxlog() looks like this (type, (x0, y0, x1, y1)). "type" can be "fill-text", "fill-image" and more, showing the object type. The second tuple is the boundary box.

oscarlevin commented 10 months ago

It looks like pdfCropMargins 2.x now uses pyMuPDF anyway. But I think we tried that and it caused issues. Might require an update from core, which would be good anyway, since anyone not using the CLI would just pip-install it and get the most recent version.

Once that is confirmed, we should at least upgrade to pdfCropMargins so pyPDF isn't needed at all.