freedomofpress / dangerzone

Take potentially dangerous PDFs, office documents, or images and convert them to safe PDFs
https://dangerzone.rocks/
GNU Affero General Public License v3.0
3.39k stars 155 forks source link

(client) Handle RGB pages not fitting in temporary directories #526

Open apyrgio opened 11 months ago

apyrgio commented 11 months ago

(this issue is a follow up of https://github.com/freedomofpress/dangerzone/issues/518, best done after #443)

The size of a single A4 page in pixels is:

We also need to account for 3 color channels too (RGB), meaning that the final size in bytes is:

If we have 1 GiB of RAM available, we need 716 pages (72 DPI) or 165 pages (150 DPI) to fill it up. It seems that pdftoppm does use 150 DPI by default for the conversion to PPM, meaning that users with limited RAM (e.g., 1 GiB) will not be able to convert PDFs with more than 165 pages. Note that this is the case because we store the RGB files in a temporary directory as a result of the conversion to pixels.

This is a limitation that does not affect all users or files, but we need to find a solution for it.

apyrgio commented 11 months ago

Possible solutions

  1. Can we pipe the pages from container 1 to container 2, and call the program that "unites" them with the stdin as an argument?

    This would essentially remove any intermediate step and greatly reduce the bytes we'd need to store on a temp dir. However, this requires two things that we don't have right now:

    • An architecture where we spawn 2 containers that speak to each other.
    • A program that reads pages from stdin, instead of a filesystem.
  2. Can we compress each page that we receive from the first container?

    Yes, we saw that compressing an RGB page into PNG leads to 30x - 40x size reduction for typical document types (letters on white background), and only 2x size reduction for photos. This is probably fine, as we expect most multi-page documents to not be dominated by photos.

    Note that this means that the 1st container must not save pages in a temporary filesystem, but stream them instead to the host, and the host must immediately convert them to PNG, e.g., using python-pil. We already have an open issue for this: https://github.com/freedomofpress/dangerzone/issues/443

  3. Can we store the pages in a data dir?

    Previously, the way we stored intermediate pages was in the config dir of the user. This brought some issues of its own (see https://github.com/freedomofpress/dangerzone/issues/317), but also undermined the confidentiality of these documents, as traces of them could remain in the user's computer. Consider the case where the original files are in an encrypted device or tmp dir. Therefore, this is something that we can't do.

From the above solutions, it seems that (2) is the one we should go with.

Workarounds

If you are a user that has this problem, you can consider the following workarounds, if you are on Linux:

  1. You can increase the size of /tmp through /etc/fstab. See https://www.looklinux.com/how-to-resize-tmpfs-on-linux/
  2. You can specify a different temp dir using the TEMP environment variable (e.g., TEMP=/home/tmp dangerzone)
apyrgio commented 11 months ago

We also need to find out if this affects Windows / MacOS platforms, i.e., if tmpfs is used there.

deeplow commented 9 months ago

From the above solutions, it seems that (2) is the one we should go with.

I agree with this. And we can use the pillow python module to convert from rgb to png (or if needed even PDF).