freedomofpress / dangerzone

Take potentially dangerous PDFs, office documents, or images and convert them to safe PDFs
https://dangerzone.rocks/
GNU Affero General Public License v3.0
3.71k stars 172 forks source link

On-host pixels to PDF conversion #625

Closed deeplow closed 4 weeks ago

deeplow commented 11 months ago

Background

Historically on the containers version of Dangerzone the conversion happens on a second container. This was needed since Dangerzone relied on many linux-native programs for conversion such as GraphicsMagic, ghostscript (for compression via ps2pdf). In Qubes the conversion from pixels to PDF already happens on the host.

Suggestion

Allowing the second phase of the conversion to run in the host opens up several improvements for Dangerzone:

image

(partial capture, taken from https://github.com/freedomofpress/dangerzone/issues/658#issuecomment-1861366566)

Let's explain how the on-host conversion feature brings the above improvements:

Host container image independently

One of the reasons why the Dangerzone container image is not hosted in a container registry like DockerHub is because we don't have a way to verify its contents during image pull. Normally, this would not be an issue for Dangerzone, since it already restricts the capabilities and privileges of the container. Therefore, we could treat a malicious image similarly as a container that has been pwned by a malicious document.

One problem here is that the same container image is used for the second stage of the conversion. A malicious container image in that stage would cause real damage, because its output is a PDF document, not pixels.

If we performed the pixels to PDF phase on the host instead, this would mean that the container image would be used solely for the first phase, and thus we could potentially push it to an untrusted image registry.

Benefit: This could open the door for updating the container image independently from the app, which would be a massive security improvement for Dangerzone.

[!NOTE] That was the original thinking at least. Since creating the above diagram, we now run gVisor within the container image. This means that a malicious container registry can serve an image that does not contain gVisor, and significantly weaken our defenses. The resulting container will still be restricted by our security flags, but the truth is that the original point is no longer that strong.

Smaller container image

By performing the pixels to PDf phase on host, we can remove the dependency to the OCR models from our image, which are currently used by PyMuPDF. These models take a lot of space (roughly ~300MiB), and removing them can slim down our image considerably.

As for where they will go, the plan is to move them to the host instead. This means will either make them dependencies of the .deb/.rpm packages, or include them in the Windows/macOS installers.

Benefit: This would make our Debian/Fedora packages slimmer, which is important for getting included in the official repos, and in Tails (see https://github.com/freedomofpress/dangerzone/issues/669)

Dynamic loading of OCR models

If we move the OCR models to the host, we can then experiment with downloading them dynamically, when a user attempts to use a specific language.

Benefit: This results in size improvements, since we would further slim down the Windows/macOS installers, and the packages the users need to download in Linux.


Besides the above benefits, this feature opens the room for some extra improvements, namely:

Tasks

apyrgio commented 11 months ago

On Packaging

Windows

The official Tesseract site points at the Mannheim University Library for the latest installers on Tesseract. These installers do not seem to be a solution on their own, because they are provided a by third-party, and extracting a single Tesseract binary from them seems not trivial.

MacOS

The official Tesseract site suggests using either MacPorts or HomeBrew. The problem with these approaches is that Dangerzone is a package that can be installed on its own, as a DMG. So, we can't expect that users will have HomeBrew or MacPorts enabled.

Building from Source

If we want to build Tesseract from source, we should have in mind that we probably need to build a standalone binary (i.e., without shared libraries). There have been attempts to create standalone binaries, but they are not straight-forward as well.

apyrgio commented 11 months ago

We had a breakthrough in this front. Turns out that installing Tesseract-OCR is not a prerequisite for using the OCR capabilities of PyMuPDF. We were mistaken that this was the case because PyMuPDF specified for these functions that "... Will fail if Tesseract is not installed". Also, one more thing that threw us off is that other projects like PyTesseract require a Tesseract installation.

deeplow commented 10 months ago
  • [ ] build the PDF as pages are created (we currently first convert everything and only then do the rest)

I added one more topic to the list.

apyrgio commented 10 months ago

Added some extra tasks:

  • [ ] Download OCR language data under share/ for inclusion on Windows / MacOS, as well as when we do development.
  • [ ] Add OCR language data for all languages as optional dependencies in our Linux packages (.deb/.rpm)
  • [ ] Test on-host conversion on Windows, both on local builds and on our CI.
apyrgio commented 2 months ago

Design overview

In this section, we'll explain how we plan to implement the on-host conversion feature.

As a reminder, here's in pseudocode the current conversion flow:

p = start_first_container()
p.stdin.write(document)
d = create_tmp_dir()
for (width, height, pixels) in p.stdout:
    # Write width, height, and pixel data as individual files in tmp dir

p = start_second_container()  # Also mount the temporary directory
p.wait()
# Move safe PDF from tmp dir to the target path

You can see that the container for the first phase starts, and converts the pages to pixels. Then the second container starts, find the pixels as mounted files, and produces a safe PDF

We plan to change this flow as follows:

p = start_first_container()
safe_pdf = create_empty_pdf()
p.stdin.write(document)
for (width, height, pixels) in p.stdout:
    pixmap = create_pymupdf_pixmap(width, height, pixels)
    if ocr_lang:
        page = ocr_page(pixmap, ocr_lang)
    else:
       page = convert_pixmap_to_pdf(pixmap)
    safe_pdf.insert_pdf(page)
# Write safe PDF to the target path

In this case, the first container starts, and remains running for the whole duration of the conversion. For every page that we get from the container's stdout, we convert it to a PyMuPDF pixmap. Then, we convert it to a single-page PDF, either by OCRing it, or by converting the pixels to PNG and then to PDF. Finally, we build a safe PDF by inserting sequentially each page in it.

apyrgio commented 2 months ago

The Debian situation

While implementing the above solution on #748, we realized that of all platforms, we have a big problem in Debian-based ones specifically. Here's a list of issues we've encountered so far:

  1. Ubuntu Focal / Debian Bullseye offer PyMuPDF versions (1.16.11 and 1.17.4) which do not have OCR capabilities. These have been added in 1.19.0 (see "Changes in Version 1.19.0" section in https://pymupdf.readthedocs.io/en/latest/changes.html).
  2. We have experienced various segfaults in PyMuPDF versions < 1.22.
  3. Debian and Ubuntu do not actually enable the OCR feature in any PyMuPDF package (!), and there's no way to detect this unless we attempt a conversion.
  4. The PyMuPDF API changes significantly between versions, to the point where almost every function we use in some Debian-based distro may not exist in another one.

Approaches we considered

We have to sidestep this issue, and to do so, we thought of and evaluated several different approaches. We'll list below the ones we dropped, and explain why:

In Rust we trust?

If PyMuPDF cannot solve everything for us across operating systems, then could Rust help here? We could create a Rust library, importable via PyO3, and built for our three OSes (Windows, macOS, Linux).

The functionality of the Rust library would be very trivial. Just take the pixel data, convert them to PNG, compress them, and iteratively build a PDF from each page. That's super easy to do. However, the main problem here is that we didn't find any Rust library that can OCR a PNG and produce a PDF, without calling Tesseract under the hood.

Statically compiling Tesseract in our Rust library is a mandatory requirement, else it's too much of a maintainance hassle to use it on Windows / macOS systems which don't have Tesseract installed (see rationale in https://github.com/freedomofpress/dangerzone/issues/625#issuecomment-1827659843).

Provide our own PyMuPDF package?

If the OS-provided PyMuPDF package is not enough for our needs, we could build our own from a version that works, and provide it from our repo. That's what we've done for the conmon package in the past (see https://github.com/freedomofpress/maint-dangerzone-conmon/tree/ubuntu/jammy)

The problem here is that unlike conmon, where we added a single patch to the package, the PyMuPDF API is not consistent across versions. We cannot build a single PyMuPDF package that is guaranteed to work for all applications in Ubuntu Focal and Debian Trixie.

Vendor PyMuPDF from PyPI in our package?

Another solution is to copy the contents of the PyMuPDF wheel from PyPI into a source dir of Dangerzone, solely for Debian-based distros, and use PyMuPDF from there. Unlike the Debian package route, this route will not affect other users, and we can maintain full control over this package.

The problem with this approach is that vendoring Python packages is usually against Debian's policy for inclusion in the official repos (think the requests package and the urllib3 situation). Whatever solution we end up with should be one that allows us to be included in the Debian repos.

Use tesseract CLI directly?

Another suggestion is to use tesseract CLI directly just for Debian, and solely for the OCR conversion part. Then, we can use PyMuPDF for the rest of the actions.

The problem here is that, as we pointed out above, we've encountered segfaults with PyMuPDF < 1.22 in several Debian versions. So, a Tesseract-only solution does not cut it.

Use a PyMuPDF alternative?

To the best of our knowledge, there's no PyMuPDf alternative with OCR capabilities that is available in all of our Debian-based platforms. The best solution we can come up with is:

The problems with the above combination are that:

Suggested approach

Taking into account the above issues, the suggested approach is to :drum: :drum: :drum: ; run the second phase of the conversion in a container, just for Debian. Ok, I know this is counter-intuitive, but hear me out.

If we go all-in on on-host conversion, we lose our versatility in case we want to add a new feature that is not supported by every platform. We also lose our escape hatch, in case a dependency has a regression in any of the platforms we support.

Does that mean we backtrack though? Do we continue to mount volumes, and include Tesseract data in our container image? No, the idea is to change the way we run the container for the second phase. Here's what we can do:

This way, we have the following benefits:

The only caveat is that hosting the container image in an untrusted image registry is now prohibitive, since it will be used in Debian as well. Note that this objective is already weakened from the fact that we spawn gVisor in it.

legoktm commented 2 months ago

I am curious why this is just a Debian/Ubuntu issue and not Fedora? Is their PyMuPDF package setup differently/better?

Vendor PyMuPDF from PyPI in our package? ... The problem with this approach is that vendoring Python packages is usually against Debian's policy for inclusion in the official repos (think the requests package and the urllib3 situation). Whatever solution we end up with should be one that allows us to be included in the Debian repos.

The section in Debian policy is https://www.debian.org/doc/debian-policy/ch-source.html#embedded-code-copies, which uses "should" to discourage vendoring but doesn't outright ban it. I think there needs to be a strong justification for it (✔️) and that we have a plan for handling any security/etc. updates that are needed for it, which I assume we'd be able to do if we went down this route.

Also, earlier you said:

the PyMuPDF API is not consistent across versions. We cannot build a single PyMuPDF package that is guaranteed to work for all applications in Ubuntu Focal and Debian Trixie.

Which I think is at odds with the goal of having official Debian packages. Because every new version wouldn't get pushed to all supported Debian releases, it would just go into unstable, with stable/oldstable staying the same. So you still only need to meet the latest version of the PyMuPDF API, and not all the way back until bullseye.

apyrgio commented 2 months ago

I am curious why this is just a Debian/Ubuntu issue and not Fedora? Is their PyMuPDF package setup differently/better?

Fedora enables the OCR feature in their PyMuPDF package, whereas Debian doesn't, which is an important distinction. Furthermore, the PyMuPDF packages for Fedora 39+ are > 1.22, and therefore we haven't encountered segfaults there. On the other hand, several Debian/Ubuntu releases we support offer PyMuPDF < 1.22, and we encounter segfaults there.

Maybe there's a patch that distros can backport, but I haven't managed to reach to a conclusion about the nature of these errors.

The section in Debian policy is https://www.debian.org/doc/debian-policy/ch-source.html#embedded-code-copies, which uses "should" to discourage vendoring but doesn't outright ban it.

Thanks for the link, I didn't know the exact wording for this policy. I do see "should" there, but I see some other policies as well that we need to circumvent as well:

But if you think that our use case can be an exception, then I'll seriously consider this option.

Because every new version wouldn't get pushed to all supported Debian releases, it would just go into unstable, with stable/oldstable staying the same. So you still only need to meet the latest version of the PyMuPDF API, and not all the way back until bullseye.

Yeah, I agree in principle. My argument though in that specific section was not about what would happen if the PyMuPDF API changes again in the future. I was talking about the current situation. We ship Dangerzone to the following Debian-based distros:

We cannot build a PyMuPDF package that is both > 1.22 (e.g., like the one in Debian Trixie), and can also be used without API incompatibilities in Ubuntu Focal. In any case relying on such a package is a no-no for inclusion in Debian, so I don't think we can go down that route.

legoktm commented 2 months ago

Fedora enables the OCR feature in their PyMuPDF package, whereas Debian doesn't, which is an important distinction. Furthermore, the PyMuPDF packages for Fedora 39+ are > 1.22, and therefore we haven't encountered segfaults there. On the other hand, several Debian/Ubuntu releases we support offer PyMuPDF < 1.22, and we encounter segfaults there.

Getting bug reports for this would be good :)

But if you think that our use case can be an exception, then I'll seriously consider this option.

I wouldn't immediately count it out, I think it would need some level of input from the Debian security team and demonstration that the PyMuPDF package is deficient and unfixable. The former is clear and can be demonstrated via bug reports, but that it's unfixable seems up in the air.

We cannot build a PyMuPDF package that is both > 1.22 (e.g., like the one in Debian Trixie), and can also be used without API incompatibilities in Ubuntu Focal.

I think the two goals of "Support focal and trixie with identical packages" and "Get official packages into Debian" are at odds with each other and we/you probably need to compromise somewhere. Maybe something like using the package for trixie and for the older versions that won't end up in Debian proper, vendoring PyMuPDF.

apyrgio commented 2 months ago

Getting bug reports for this would be good :) ... I wouldn't immediately count it out, I think it would need some level of input from the Debian security team and demonstration that the PyMuPDF package is deficient and unfixable. The former is clear and can be demonstrated via bug reports, but that it's unfixable seems up in the air.

Gotcha, you're right. It's best to have a reproducible example for these segfaults. I didn't do so yet, because they are fixed in newer versions, but maybe Debian devs are interested in it.

I think the two goals of "Support focal and trixie with identical packages" and "Get official packages into Debian" are at odds with each other and we/you probably need to compromise somewhere. Maybe something like using the package for trixie and for the older versions that won't end up in Debian proper, vendoring PyMuPDF.

Oh wait, there's something important here that informs whatever decision we take on this subject. One of the "end goals" is to make Dangerzone eligible for landing on Debian Bookworm, and therefore Tails. So, I'm aiming for a proper Debian package not just for Debian Trixie, but for Bookworm as well. What's your opinion on the latter, do you think it's possible if we package PyMuPDF on our own?

legoktm commented 2 months ago

Getting a package into bookworm at this point isn't possible at all. Once it's in trixie, we can add it to bookworm-backports, but I'm not sure if Tails has backports enabled.

apyrgio commented 2 months ago

Ok, it's good to know that. My understanding is that Tails can work with Bookworm backports (see onionshare issue), but that it can also backport a Debian package into their repos, so I guess that's a viable route as well.

So, we an alternative path forward is:

  1. Vendor PyMuPDF for all Debian-based distros.
    • We should probably vendor the version used in our Windows/macOS installations, i.e., the one in our pyproject.toml for two reasons; a) feature-parity, b) less PyMuPDF versions to track for security updates.
  2. Create a Debian bug report for PyMuPDF, and mention that the OCR feature is disabled.
    • If the Debian maintainer can enable it, we can expect a new package to land in Debian Trixie.
    • If they can't enable it, then find an alternative; either rebuild the Debian Trixie package with the OCR compile flag on, or use another way for OCR
  3. Create a Debian bug report for the segfaults we've encountered in versions < 1.22.

This way, the PyMuPDF versions that we support are fairly recent, and their API should be consistent across all platforms. Plus, we won't have to use the container at all for the second stage of the conversion, although I must stress again that this may be a limiting factor in the long run.

I have just a few concerns:

  1. I'm a bit concerned about the maintainability of the vendored PyMuPDF code. Ultimately though, we will update it as we update PyMuPDF on Windows/macOS, so security-wise it's no different.
  2. I'm not sure what's the Ubuntu situation, in case we want to be included there. In theory, Ubuntu will pick up the Dangerzone / PyMuPDF version in Trixie and use it for its subsequent releases, but I'm not super sure.

Kunal, thanks a bunch for your input. If the above plan makes sense to you (from the perspective of a Debian dev), then I'll give it a shot.

legoktm commented 2 months ago

Yep, overall that sounds good to me! Maybe explicitly "Vendor PyMuPDF for all Debian-based distros until the distro-provided package meets our needs"?

I'm not sure what's the Ubuntu situation, in case we want to be included there. In theory, Ubuntu will pick up the Dangerzone / PyMuPDF version in Trixie and use it for its subsequent releases, but I'm not super sure.

Exactly. Ubuntu will grab the latest version in testing and include it in the current development distro, and continue picking up changes until the Ubuntu release is frozen and shipped.

Kunal, thanks a bunch for your input. If the above plan makes sense to you (from the perspective of a Debian dev), then I'll give it a shot.

You're welcome :)

apyrgio commented 2 months ago

Maybe explicitly "Vendor PyMuPDF for all Debian-based distros until the distro-provided package meets our needs"?

Most def.


Regarding package vendoring, it's quite simple in the case of Dangerzone. What we need to do is:

  1. Create a requirements.txt file for the Python package and its dependencies. For PyMuPDF, we already have a tool.poetry.group.container.dependencies section in our pyproject.toml, so we can create a requirements file with:

    poetry export --only container > requirements.txt
  2. Create a vendor/ directory under the dangerzone source dir, and install the packages in there:

     pip install -t dangerzone/vendor -r requirements.txt
  3. Make Dangerzone favor the packages within the vendor/ module, instead of the system ones:

      diff --git a/dangerzone/__init__.py b/dangerzone/__init__.py
      index f2a9961..339ca8d 100644
      --- a/dangerzone/__init__.py
      +++ b/dangerzone/__init__.py
      @@ -1,6 +1,12 @@
       import os
       import sys
    
      +try:
      +    from . import vendor
      +    sys.path.insert(0, os.path.dirname(vendor.__file__))
      +except ImportError:
      +    pass
      +
       if "DANGERZONE_MODE" in os.environ:
           mode = os.environ["DANGERZONE_MODE"]
       else:

That's it. From there on, we can do import fitz the regular way, and it will either pick up the vendored package (if it exists), or the system one.

apyrgio commented 1 month ago

I sent a Debian bug report for the lack of OCR support in PyMuPDF: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1082023 (with a typo, ugh).