apyrgio commented 9 months ago

The possibility of using PyMuPDF was brought up as a solution to the congestion problem we encountered in https://github.com/freedomofpress/dangerzone/issues/616, and was immediately introduced in PR https://github.com/freedomofpress/dangerzone/pull/622.

While looking more into how PyMuPDF works though, we realized that it can help us tackle more problems than the original one. As of writing this issue, our current understanding is that we can use PyMuPDF to:

Replace pdfinfo / pdftoppm in the 1st stage of the conversion (#622).
Replace all external commands (gm / tesseract / pdfunite / ps2pdf) in the 2nd stage of the conversion and perform the conversion on the Linux/macOS/Windows hosts (#625).
Convert a PDF to pixels, and pixels to a (searchable) PDF, without touching the filesystem (#633, #443).

This issue holds all of our questions regarding the integration of PyMuPDF, either in terms of feasibility, security, or performance, as well as other effects it has in our code.

apyrgio commented 9 months ago

How does PyMuPDF integrate with Tesseract?

PyMuPDF directly uses the C API of Tesseract. More specifically, it seems to statically link with the Tesseract library. To confidently answer this, we need to review the build scripts. However, there are some good indications that this is the case:

The PyMuPDF package on Debian does not list Tesseract or MuPDF as dependencies.
The PyMuPDF API allows the user to specify the Tesseract data directory, but not the path to the Tesseract binary

Also, we have tested that on a Windows and macOS host, the following code works without installing Tesseract, only installing PyMuPDF via PyPI:

import fitz
doc = fitz.open("./tests/test_docs/sample-pdf.pdf")
page = doc.load_page(0)
pix = page.get_pixmap()
buf = pix.pdfocr_tobytes(tessdata="/path/tessdata_fast-4.1.0")
f = open("./test.pdf", "wb")
f.write(buf)

This means that we can do OCR on macOS / Windows hosts, which we previously thought highly difficult (#625).

apyrgio commented 9 months ago

Does PyMuPDF use GhostScript?

Even though PyMuPDF and GhostScript are developed by the same company (Artifex), (Py)MuPDF does not use GhostScript. From https://en.wikipedia.org/wiki/MuPDF:

Fitz was originally intended as an R&D project to replace the aging Ghostscript graphics library, but has instead become the rendering engine powering MuPDF.

Grepping for ghostscript / postscript throughout the code does not yield any result that shows that GhostScript is involved. Actually, PostScript code seems to be handled within mupdf.

Removing our dependency on GhostScript is good news, since it has been the source of CVEs in the past.

apyrgio commented 9 months ago

How does PyMuPDF affect our container image size?

The fact that PyMuPDF allows 2nd stage conversion on the host opens the way for lots of improvements in the container image. Basically, the only packages that we need to install are:

LibreOffice
PyMuPDF
python3-magic
fonts-noto-cjk
OpenJDK8

Unfortunately, PyMuPDF is not available on Alpine Linux. This means that we need to install it with pip install, and add some build dependencies as well. Here are some findings for reducing the image size:

We should delete our build dependencies on the same step that we install them, so that they are not included in the image layer.
When using pip install, we should make it not use a filesystem cache. Else, it can take up more than 100MiB:
```
/ # du -hd 1 /root | sort -h
133.8M  /root/.cache
133.9M  /root
```
When building PyMuPDF from source, a fitz_new module is also built, which is a "rebased" implementation of PyMuPDF, that's probably not ready for production use yet. We can shave off 50MiB by removing it:
```
/ # du -hd 1 /usr/lib/python3.11/site-packages | sort -h
[...]
28.3M   /usr/lib/python3.11/site-packages/fitz
49.7M   /usr/lib/python3.11/site-packages/fitz_new
```
What about other OSes?

The fact that PyMuPDF is difficult to build in Alpine Linux begs the question: can we use a different OS? Turns out that PyMuPDF is available in the official Debian repos. This is good, because we can take advantage of two Debian properties that Alpine Linux does not have:

Slim down our container image with --no-install-recommends / --no-install-suggests. Alpine Linux does not have this flag, but instead allows you to arbitrarily delete packages. This may be very brittle though.
Install the libreoffice-core-nogui flavor of LibreOffice. This flavor has the minimum requirements for scripting LibreOffice, and does not bring any extra libraries, such as Wayland and Mesa.

On the flip side, Alpine Linux is a rolling release distro, which always gets the latest version of the upstream packages. So, we use it not just for its small footprint, but for its security properties as well. Debian takes security very seriously as well, in two different ways:

Stable flavors (Bullseye / Bookworm) generally offer less recent versions of a software, but backport security fixes from upstream as soon as possible.
Testing / Unstable flavors (Trixie / Sid) are closer to the upstream versions, but are not guaranteed to get security fixes, because they rely that the upstream will include them:

Sid exclusively gets security updates through its package maintainers. The Debian Security Team only maintains security updates for the current "stable" release.

So, it seems that if we were to switch from Alpine Linux to Debian, the Testing/Unstable flavors would offer similar security guarantees.

Comparisons

The following tables offer comparisons between the following image types:

Alpine (current): This is the Alpine image as built from the main branch.
Alpine (PyMuPDF): This is the Alpine image that has been tweaked to install only the necessary packages, plus PyMuPDF.
Debian (Unstable): This is the debian:unstable-slim image that installs only the necessary packages with --no-install-recommends / --no-install-suggests.
Debian (Stable): This is the debian:bookworm-slim image that installs only the necessary packages with --no-install-recommends / --no-install-suggests.

Image size impact

Image	Compessed (MiB)	Uncompressed (MiB)
Alpine (current)	624	1372
Alpine (PyMuPDF)	413	862
Debian (Unstable)	256	570
Debian (Stable)	253	564

Image	Packages
Alpine (current)	286
Alpine (PyMuPDF)	273
Debian (Unstable)	222
Debian (Stable)	221

CVEs impact

Image	Critical	High	Medium	Low	Negligible
Alpine (current)	0	14	37	6	0
Alpine (PyMuPDF)	0	13	35	6	0
Debian (Unstable)	0	3	8	6	129
Debian (Stable)	1	17	25	10	132
Debian (Stable, excluding `won't fix`)	0	4	6	0	131

(Debian Stable marks some CVEs as won't fix, meaning that a vulnerability does not apply to it)

apyrgio commented 9 months ago

What is PyMuPDF's potential impact?

The following diagram shows how the integration of PyMuPDF opens the door for more improvements throughout the codebase, and how it solves some limitations.

PyMuPDF Impact drawio

(this file was created in https://draw.io, and can be edited there by uploading the above .png, since it has the diagram embedded in it. sweet...)

deeplow commented 9 months ago

Thanks for this investigation @apyrgio! The PyMuPDF + debian stable slim does seem really promising.

deeplow commented 9 months ago

Performance Impact of PyMuPDF

We stress tested PyMuPDF in a large set of tests and overall found that it didn't decrease the performance in most documents. Quite the contrary in a lot of cases, but it's hard to tell since we don't have a real-world set of documents.

Other impacts of PyMuPDF

We summarized some of the results in this presentation

freedomofpress / dangerzone

PyMuPDF integration #658

How does PyMuPDF integrate with Tesseract?

Does PyMuPDF use GhostScript?

How does PyMuPDF affect our container image size?

What about other OSes?

Comparisons

Image size impact

CVEs impact

What is PyMuPDF's potential impact?

Performance Impact of PyMuPDF

Other impacts of PyMuPDF