Open apyrgio opened 11 months ago
Can we pipe the pages from container 1 to container 2, and call the program that "unites" them with the stdin as an argument?
This would essentially remove any intermediate step and greatly reduce the bytes we'd need to store on a temp dir. However, this requires two things that we don't have right now:
Can we compress each page that we receive from the first container?
Yes, we saw that compressing an RGB page into PNG leads to 30x - 40x size reduction for typical document types (letters on white background), and only 2x size reduction for photos. This is probably fine, as we expect most multi-page documents to not be dominated by photos.
Note that this means that the 1st container must not save pages in a temporary filesystem, but stream them instead to the host, and the host must immediately convert them to PNG, e.g., using python-pil
. We already have an open issue for this: https://github.com/freedomofpress/dangerzone/issues/443
Can we store the pages in a data dir?
Previously, the way we stored intermediate pages was in the config dir of the user. This brought some issues of its own (see https://github.com/freedomofpress/dangerzone/issues/317), but also undermined the confidentiality of these documents, as traces of them could remain in the user's computer. Consider the case where the original files are in an encrypted device or tmp dir. Therefore, this is something that we can't do.
From the above solutions, it seems that (2) is the one we should go with.
If you are a user that has this problem, you can consider the following workarounds, if you are on Linux:
/tmp
through /etc/fstab
. See https://www.looklinux.com/how-to-resize-tmpfs-on-linux/TEMP
environment variable (e.g., TEMP=/home/tmp dangerzone
)We also need to find out if this affects Windows / MacOS platforms, i.e., if tmpfs
is used there.
(this issue is a follow up of https://github.com/freedomofpress/dangerzone/issues/518, best done after #443)
The size of a single A4 page in pixels is:
We also need to account for 3 color channels too (RGB), meaning that the final size in bytes is:
If we have 1 GiB of RAM available, we need 716 pages (72 DPI) or 165 pages (150 DPI) to fill it up. It seems that
pdftoppm
does use 150 DPI by default for the conversion to PPM, meaning that users with limited RAM (e.g., 1 GiB) will not be able to convert PDFs with more than 165 pages. Note that this is the case because we store the RGB files in a temporary directory as a result of the conversion to pixels.This is a limitation that does not affect all users or files, but we need to find a solution for it.