Closed apyrgio closed 9 months ago
PyMuPDF directly uses the C API of Tesseract. More specifically, it seems to statically link with the Tesseract library. To confidently answer this, we need to review the build scripts. However, there are some good indications that this is the case:
Also, we have tested that on a Windows and macOS host, the following code works without installing Tesseract, only installing PyMuPDF via PyPI:
import fitz
doc = fitz.open("./tests/test_docs/sample-pdf.pdf")
page = doc.load_page(0)
pix = page.get_pixmap()
buf = pix.pdfocr_tobytes(tessdata="/path/tessdata_fast-4.1.0")
f = open("./test.pdf", "wb")
f.write(buf)
This means that we can do OCR on macOS / Windows hosts, which we previously thought highly difficult (#625).
Even though PyMuPDF and GhostScript are developed by the same company (Artifex), (Py)MuPDF does not use GhostScript. From https://en.wikipedia.org/wiki/MuPDF:
Fitz was originally intended as an R&D project to replace the aging Ghostscript graphics library, but has instead become the rendering engine powering MuPDF.
Grepping for ghostscript / postscript throughout the code does not yield any result that shows that GhostScript is involved. Actually, PostScript code seems to be handled within mupdf.
Removing our dependency on GhostScript is good news, since it has been the source of CVEs in the past.
The fact that PyMuPDF allows 2nd stage conversion on the host opens the way for lots of improvements in the container image. Basically, the only packages that we need to install are:
Unfortunately, PyMuPDF is not available on Alpine Linux. This means that we need to install it with pip install
, and add some build dependencies as well. Here are some findings for reducing the image size:
We should delete our build dependencies on the same step that we install them, so that they are not included in the image layer.
When using pip install
, we should make it not use a filesystem cache. Else, it can take up more than 100MiB:
/ # du -hd 1 /root | sort -h
133.8M /root/.cache
133.9M /root
When building PyMuPDF from source, a fitz_new
module is also built, which is a "rebased" implementation of PyMuPDF, that's probably not ready for production use yet. We can shave off 50MiB by removing it:
/ # du -hd 1 /usr/lib/python3.11/site-packages | sort -h
[...]
28.3M /usr/lib/python3.11/site-packages/fitz
49.7M /usr/lib/python3.11/site-packages/fitz_new
The fact that PyMuPDF is difficult to build in Alpine Linux begs the question: can we use a different OS? Turns out that PyMuPDF is available in the official Debian repos. This is good, because we can take advantage of two Debian properties that Alpine Linux does not have:
--no-install-recommends
/ --no-install-suggests
. Alpine Linux does not have this flag, but instead allows you to arbitrarily delete packages. This may be very brittle though.libreoffice-core-nogui
flavor of LibreOffice. This flavor has the minimum requirements for scripting LibreOffice, and does not bring any extra libraries, such as Wayland and Mesa.On the flip side, Alpine Linux is a rolling release distro, which always gets the latest version of the upstream packages. So, we use it not just for its small footprint, but for its security properties as well. Debian takes security very seriously as well, in two different ways:
Testing / Unstable flavors (Trixie / Sid) are closer to the upstream versions, but are not guaranteed to get security fixes, because they rely that the upstream will include them:
Sid exclusively gets security updates through its package maintainers. The Debian Security Team only maintains security updates for the current "stable" release.
So, it seems that if we were to switch from Alpine Linux to Debian, the Testing/Unstable flavors would offer similar security guarantees.
The following tables offer comparisons between the following image types:
main
branch.debian:unstable-slim
image that installs only the necessary packages with --no-install-recommends
/ --no-install-suggests
.debian:bookworm-slim
image that installs only the necessary packages with --no-install-recommends
/ --no-install-suggests
.Image | Compessed (MiB) | Uncompressed (MiB) |
---|---|---|
Alpine (current) | 624 | 1372 |
Alpine (PyMuPDF) | 413 | 862 |
Debian (Unstable) | 256 | 570 |
Debian (Stable) | 253 | 564 |
Image | Packages |
---|---|
Alpine (current) | 286 |
Alpine (PyMuPDF) | 273 |
Debian (Unstable) | 222 |
Debian (Stable) | 221 |
Image | Critical | High | Medium | Low | Negligible |
---|---|---|---|---|---|
Alpine (current) | 0 | 14 | 37 | 6 | 0 |
Alpine (PyMuPDF) | 0 | 13 | 35 | 6 | 0 |
Debian (Unstable) | 0 | 3 | 8 | 6 | 129 |
Debian (Stable) | 1 | 17 | 25 | 10 | 132 |
Debian (Stable, excluding won't fix ) |
0 | 4 | 6 | 0 | 131 |
(Debian Stable marks some CVEs as won't fix, meaning that a vulnerability does not apply to it)
The following diagram shows how the integration of PyMuPDF opens the door for more improvements throughout the codebase, and how it solves some limitations.
(this file was created in https://draw.io, and can be edited there by uploading the above .png
, since it has the diagram embedded in it. sweet...)
Thanks for this investigation @apyrgio! The PyMuPDF + debian stable slim does seem really promising.
We stress tested PyMuPDF in a large set of tests and overall found that it didn't decrease the performance in most documents. Quite the contrary in a lot of cases, but it's hard to tell since we don't have a real-world set of documents.
We summarized some of the results in this presentation
The possibility of using PyMuPDF was brought up as a solution to the congestion problem we encountered in https://github.com/freedomofpress/dangerzone/issues/616, and was immediately introduced in PR https://github.com/freedomofpress/dangerzone/pull/622.
While looking more into how PyMuPDF works though, we realized that it can help us tackle more problems than the original one. As of writing this issue, our current understanding is that we can use PyMuPDF to:
pdfinfo
/pdftoppm
in the 1st stage of the conversion (#622).gm
/tesseract
/pdfunite
/ps2pdf
) in the 2nd stage of the conversion and perform the conversion on the Linux/macOS/Windows hosts (#625).This issue holds all of our questions regarding the integration of PyMuPDF, either in terms of feasibility, security, or performance, as well as other effects it has in our code.