Closed deeplow closed 4 weeks ago
The official Tesseract site points at the Mannheim University Library for the latest installers on Tesseract. These installers do not seem to be a solution on their own, because they are provided a by third-party, and extracting a single Tesseract binary from them seems not trivial.
The official Tesseract site suggests using either MacPorts or HomeBrew. The problem with these approaches is that Dangerzone is a package that can be installed on its own, as a DMG. So, we can't expect that users will have HomeBrew or MacPorts enabled.
If we want to build Tesseract from source, we should have in mind that we probably need to build a standalone binary (i.e., without shared libraries). There have been attempts to create standalone binaries, but they are not straight-forward as well.
We had a breakthrough in this front. Turns out that installing Tesseract-OCR is not a prerequisite for using the OCR capabilities of PyMuPDF. We were mistaken that this was the case because PyMuPDF specified for these functions that "... Will fail if Tesseract is not installed". Also, one more thing that threw us off is that other projects like PyTesseract require a Tesseract installation.
- [ ] build the PDF as pages are created (we currently first convert everything and only then do the rest)
I added one more topic to the list.
Added some extra tasks:
- [ ] Download OCR language data under
share/
for inclusion on Windows / MacOS, as well as when we do development.- [ ] Add OCR language data for all languages as optional dependencies in our Linux packages (.deb/.rpm)
- [ ] Test on-host conversion on Windows, both on local builds and on our CI.
In this section, we'll explain how we plan to implement the on-host conversion feature.
As a reminder, here's in pseudocode the current conversion flow:
p = start_first_container()
p.stdin.write(document)
d = create_tmp_dir()
for (width, height, pixels) in p.stdout:
# Write width, height, and pixel data as individual files in tmp dir
p = start_second_container() # Also mount the temporary directory
p.wait()
# Move safe PDF from tmp dir to the target path
You can see that the container for the first phase starts, and converts the pages to pixels. Then the second container starts, find the pixels as mounted files, and produces a safe PDF
We plan to change this flow as follows:
p = start_first_container()
safe_pdf = create_empty_pdf()
p.stdin.write(document)
for (width, height, pixels) in p.stdout:
pixmap = create_pymupdf_pixmap(width, height, pixels)
if ocr_lang:
page = ocr_page(pixmap, ocr_lang)
else:
page = convert_pixmap_to_pdf(pixmap)
safe_pdf.insert_pdf(page)
# Write safe PDF to the target path
In this case, the first container starts, and remains running for the whole duration of the conversion. For every page that we get from the container's stdout, we convert it to a PyMuPDF pixmap. Then, we convert it to a single-page PDF, either by OCRing it, or by converting the pixels to PNG and then to PDF. Finally, we build a safe PDF by inserting sequentially each page in it.
While implementing the above solution on #748, we realized that of all platforms, we have a big problem in Debian-based ones specifically. Here's a list of issues we've encountered so far:
We have to sidestep this issue, and to do so, we thought of and evaluated several different approaches. We'll list below the ones we dropped, and explain why:
If PyMuPDF cannot solve everything for us across operating systems, then could Rust help here? We could create a Rust library, importable via PyO3, and built for our three OSes (Windows, macOS, Linux).
The functionality of the Rust library would be very trivial. Just take the pixel data, convert them to PNG, compress them, and iteratively build a PDF from each page. That's super easy to do. However, the main problem here is that we didn't find any Rust library that can OCR a PNG and produce a PDF, without calling Tesseract under the hood.
Statically compiling Tesseract in our Rust library is a mandatory requirement, else it's too much of a maintainance hassle to use it on Windows / macOS systems which don't have Tesseract installed (see rationale in https://github.com/freedomofpress/dangerzone/issues/625#issuecomment-1827659843).
If the OS-provided PyMuPDF package is not enough for our needs, we could build our own from a version that works, and provide it from our repo. That's what we've done for the conmon
package in the past (see https://github.com/freedomofpress/maint-dangerzone-conmon/tree/ubuntu/jammy)
The problem here is that unlike conmon
, where we added a single patch to the package, the PyMuPDF API is not consistent across versions. We cannot build a single PyMuPDF package that is guaranteed to work for all applications in Ubuntu Focal and Debian Trixie.
Another solution is to copy the contents of the PyMuPDF wheel from PyPI into a source dir of Dangerzone, solely for Debian-based distros, and use PyMuPDF from there. Unlike the Debian package route, this route will not affect other users, and we can maintain full control over this package.
The problem with this approach is that vendoring Python packages is usually against Debian's policy for inclusion in the official repos (think the requests
package and the urllib3
situation). Whatever solution we end up with should be one that allows us to be included in the Debian repos.
tesseract
CLI directly?Another suggestion is to use tesseract
CLI directly just for Debian, and solely for the OCR conversion part. Then, we can use PyMuPDF for the rest of the actions.
The problem here is that, as we pointed out above, we've encountered segfaults with PyMuPDF < 1.22 in several Debian versions. So, a Tesseract-only solution does not cut it.
To the best of our knowledge, there's no PyMuPDf alternative with OCR capabilities that is available in all of our Debian-based platforms. The best solution we can come up with is:
The problems with the above combination are that:
Taking into account the above issues, the suggested approach is to :drum: :drum: :drum: ; run the second phase of the conversion in a container, just for Debian. Ok, I know this is counter-intuitive, but hear me out.
If we go all-in on on-host conversion, we lose our versatility in case we want to add a new feature that is not supported by every platform. We also lose our escape hatch, in case a dependency has a regression in any of the platforms we support.
Does that mean we backtrack though? Do we continue to mount volumes, and include Tesseract data in our container image? No, the idea is to change the way we run the container for the second phase. Here's what we can do:
dangerzone.conversion.pixels_to_pdf
Python module in the container image.ocr_language_name
: None
if not enabled, else the name of the language.ocr_language_data
: None
if not enabled, else the Tesseract data for that language (in base64 form)
pages
: The number of pages in the document.This way, we have the following benefits:
The only caveat is that hosting the container image in an untrusted image registry is now prohibitive, since it will be used in Debian as well. Note that this objective is already weakened from the fact that we spawn gVisor in it.
I am curious why this is just a Debian/Ubuntu issue and not Fedora? Is their PyMuPDF package setup differently/better?
Vendor PyMuPDF from PyPI in our package? ... The problem with this approach is that vendoring Python packages is usually against Debian's policy for inclusion in the official repos (think the
requests
package and theurllib3
situation). Whatever solution we end up with should be one that allows us to be included in the Debian repos.
The section in Debian policy is https://www.debian.org/doc/debian-policy/ch-source.html#embedded-code-copies, which uses "should" to discourage vendoring but doesn't outright ban it. I think there needs to be a strong justification for it (✔️) and that we have a plan for handling any security/etc. updates that are needed for it, which I assume we'd be able to do if we went down this route.
Also, earlier you said:
the PyMuPDF API is not consistent across versions. We cannot build a single PyMuPDF package that is guaranteed to work for all applications in Ubuntu Focal and Debian Trixie.
Which I think is at odds with the goal of having official Debian packages. Because every new version wouldn't get pushed to all supported Debian releases, it would just go into unstable, with stable/oldstable staying the same. So you still only need to meet the latest version of the PyMuPDF API, and not all the way back until bullseye.
I am curious why this is just a Debian/Ubuntu issue and not Fedora? Is their PyMuPDF package setup differently/better?
Fedora enables the OCR feature in their PyMuPDF package, whereas Debian doesn't, which is an important distinction. Furthermore, the PyMuPDF packages for Fedora 39+ are > 1.22, and therefore we haven't encountered segfaults there. On the other hand, several Debian/Ubuntu releases we support offer PyMuPDF < 1.22, and we encounter segfaults there.
Maybe there's a patch that distros can backport, but I haven't managed to reach to a conclusion about the nature of these errors.
The section in Debian policy is https://www.debian.org/doc/debian-policy/ch-source.html#embedded-code-copies, which uses "should" to discourage vendoring but doesn't outright ban it.
Thanks for the link, I didn't know the exact wording for this policy. I do see "should" there, but I see some other policies as well that we need to circumvent as well:
But if you think that our use case can be an exception, then I'll seriously consider this option.
Because every new version wouldn't get pushed to all supported Debian releases, it would just go into unstable, with stable/oldstable staying the same. So you still only need to meet the latest version of the PyMuPDF API, and not all the way back until bullseye.
Yeah, I agree in principle. My argument though in that specific section was not about what would happen if the PyMuPDF API changes again in the future. I was talking about the current situation. We ship Dangerzone to the following Debian-based distros:
We cannot build a PyMuPDF package that is both > 1.22 (e.g., like the one in Debian Trixie), and can also be used without API incompatibilities in Ubuntu Focal. In any case relying on such a package is a no-no for inclusion in Debian, so I don't think we can go down that route.
Fedora enables the OCR feature in their PyMuPDF package, whereas Debian doesn't, which is an important distinction. Furthermore, the PyMuPDF packages for Fedora 39+ are > 1.22, and therefore we haven't encountered segfaults there. On the other hand, several Debian/Ubuntu releases we support offer PyMuPDF < 1.22, and we encounter segfaults there.
Getting bug reports for this would be good :)
But if you think that our use case can be an exception, then I'll seriously consider this option.
I wouldn't immediately count it out, I think it would need some level of input from the Debian security team and demonstration that the PyMuPDF package is deficient and unfixable. The former is clear and can be demonstrated via bug reports, but that it's unfixable seems up in the air.
We cannot build a PyMuPDF package that is both > 1.22 (e.g., like the one in Debian Trixie), and can also be used without API incompatibilities in Ubuntu Focal.
I think the two goals of "Support focal and trixie with identical packages" and "Get official packages into Debian" are at odds with each other and we/you probably need to compromise somewhere. Maybe something like using the package for trixie and for the older versions that won't end up in Debian proper, vendoring PyMuPDF.
Getting bug reports for this would be good :) ... I wouldn't immediately count it out, I think it would need some level of input from the Debian security team and demonstration that the PyMuPDF package is deficient and unfixable. The former is clear and can be demonstrated via bug reports, but that it's unfixable seems up in the air.
Gotcha, you're right. It's best to have a reproducible example for these segfaults. I didn't do so yet, because they are fixed in newer versions, but maybe Debian devs are interested in it.
I think the two goals of "Support focal and trixie with identical packages" and "Get official packages into Debian" are at odds with each other and we/you probably need to compromise somewhere. Maybe something like using the package for trixie and for the older versions that won't end up in Debian proper, vendoring PyMuPDF.
Oh wait, there's something important here that informs whatever decision we take on this subject. One of the "end goals" is to make Dangerzone eligible for landing on Debian Bookworm, and therefore Tails. So, I'm aiming for a proper Debian package not just for Debian Trixie, but for Bookworm as well. What's your opinion on the latter, do you think it's possible if we package PyMuPDF on our own?
Getting a package into bookworm at this point isn't possible at all. Once it's in trixie, we can add it to bookworm-backports, but I'm not sure if Tails has backports enabled.
Ok, it's good to know that. My understanding is that Tails can work with Bookworm backports (see onionshare issue), but that it can also backport a Debian package into their repos, so I guess that's a viable route as well.
So, we an alternative path forward is:
pyproject.toml
for two reasons; a) feature-parity, b) less PyMuPDF versions to track for security updates.This way, the PyMuPDF versions that we support are fairly recent, and their API should be consistent across all platforms. Plus, we won't have to use the container at all for the second stage of the conversion, although I must stress again that this may be a limiting factor in the long run.
I have just a few concerns:
Kunal, thanks a bunch for your input. If the above plan makes sense to you (from the perspective of a Debian dev), then I'll give it a shot.
Yep, overall that sounds good to me! Maybe explicitly "Vendor PyMuPDF for all Debian-based distros until the distro-provided package meets our needs"?
I'm not sure what's the Ubuntu situation, in case we want to be included there. In theory, Ubuntu will pick up the Dangerzone / PyMuPDF version in Trixie and use it for its subsequent releases, but I'm not super sure.
Exactly. Ubuntu will grab the latest version in testing and include it in the current development distro, and continue picking up changes until the Ubuntu release is frozen and shipped.
Kunal, thanks a bunch for your input. If the above plan makes sense to you (from the perspective of a Debian dev), then I'll give it a shot.
You're welcome :)
Maybe explicitly "Vendor PyMuPDF for all Debian-based distros until the distro-provided package meets our needs"?
Most def.
Regarding package vendoring, it's quite simple in the case of Dangerzone. What we need to do is:
Create a requirements.txt
file for the Python package and its dependencies. For PyMuPDF, we already have a tool.poetry.group.container.dependencies
section in our pyproject.toml
, so we can create a requirements file with:
poetry export --only container > requirements.txt
Create a vendor/
directory under the dangerzone
source dir, and install the packages in there:
pip install -t dangerzone/vendor -r requirements.txt
Make Dangerzone favor the packages within the vendor/
module, instead of the system ones:
diff --git a/dangerzone/__init__.py b/dangerzone/__init__.py
index f2a9961..339ca8d 100644
--- a/dangerzone/__init__.py
+++ b/dangerzone/__init__.py
@@ -1,6 +1,12 @@
import os
import sys
+try:
+ from . import vendor
+ sys.path.insert(0, os.path.dirname(vendor.__file__))
+except ImportError:
+ pass
+
if "DANGERZONE_MODE" in os.environ:
mode = os.environ["DANGERZONE_MODE"]
else:
That's it. From there on, we can do import fitz
the regular way, and it will either pick up the vendored package (if it exists), or the system one.
I sent a Debian bug report for the lack of OCR support in PyMuPDF: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1082023 (with a typo, ugh).
Background
Historically on the containers version of Dangerzone the conversion happens on a second container. This was needed since Dangerzone relied on many linux-native programs for conversion such as GraphicsMagic, ghostscript (for compression via
ps2pdf
). In Qubes the conversion from pixels to PDF already happens on the host.Suggestion
Allowing the second phase of the conversion to run in the host opens up several improvements for Dangerzone:
(partial capture, taken from https://github.com/freedomofpress/dangerzone/issues/658#issuecomment-1861366566)
Let's explain how the on-host conversion feature brings the above improvements:
Host container image independently
One of the reasons why the Dangerzone container image is not hosted in a container registry like DockerHub is because we don't have a way to verify its contents during image pull. Normally, this would not be an issue for Dangerzone, since it already restricts the capabilities and privileges of the container. Therefore, we could treat a malicious image similarly as a container that has been pwned by a malicious document.
One problem here is that the same container image is used for the second stage of the conversion. A malicious container image in that stage would cause real damage, because its output is a PDF document, not pixels.
If we performed the pixels to PDF phase on the host instead, this would mean that the container image would be used solely for the first phase, and thus we could potentially push it to an untrusted image registry.
Benefit: This could open the door for updating the container image independently from the app, which would be a massive security improvement for Dangerzone.
Smaller container image
By performing the pixels to PDf phase on host, we can remove the dependency to the OCR models from our image, which are currently used by PyMuPDF. These models take a lot of space (roughly ~300MiB), and removing them can slim down our image considerably.
As for where they will go, the plan is to move them to the host instead. This means will either make them dependencies of the .deb/.rpm packages, or include them in the Windows/macOS installers.
Benefit: This would make our Debian/Fedora packages slimmer, which is important for getting included in the official repos, and in Tails (see https://github.com/freedomofpress/dangerzone/issues/669)
Dynamic loading of OCR models
If we move the OCR models to the host, we can then experiment with downloading them dynamically, when a user attempts to use a specific language.
Benefit: This results in size improvements, since we would further slim down the Windows/macOS installers, and the packages the users need to download in Linux.
Besides the above benefits, this feature opens the room for some extra improvements, namely:
Tasks
share/
for inclusion on Windows / MacOS, as well as when we do development (see #748).