freedomofpress / dangerzone

Take potentially dangerous PDFs, office documents, or images and convert them to safe PDFs
https://dangerzone.rocks/
GNU Affero General Public License v3.0
3.74k stars 172 forks source link

Dangerzone is massively inflating file sizes by default - am I missing something? #970

Closed TechReverie closed 3 weeks ago

TechReverie commented 4 weeks ago

What happened?

When I run a pdf file through Dangerzone the output file is huge compared to the original - for example 4MB --> 20MB, 50MB --> 282MB, 85MB --> 287MB.

I was under the impression that as part of the conversion the files were compressed. Did I get that wrong?

Linux distribution

Linuxmint 21.3 or Fedora 40 - both result in the same inflated results, and both inflate the files to the same file size

Dangerzone version

0.7.1

Podman info

No response

Document conversion logs

No response

Additional info

No response

almet commented 4 weeks ago

Hey, thanks for opening a bug report, that certainly seem suspicious. We've already seen this, but I believe not the extent you're reporting now, and have an issue for this here: https://github.com/freedomofpress/dangerzone/issues/239

If that's possible for you (if some of the PDFs leading to these changes are shareable), would it be possible to send them to us at alexis@freedom.press? (or attach it here if you feel like it)?

TechReverie commented 4 weeks ago

Hi, thanks for getting back to me so quickly.

I'm not sure due to copyright that I can share the exact files I tried this with, but they can be downloaded straight from the publisher here:- https://magpi.raspberrypi.com/issues, if that helps.

Attached image is a directory listing showing the before/after of converting issues 136,137, and 146. comparing sizes of dangerzone converted files

I've had a further play with some of my old archived instruction manuals which show differing results, so I wonder if some sort of particular PDF format that may be bugging the software? Beyond my skills to know the difference between these so I've attached the originals for your perusal if that helps.

Note from the directory image that the freenas guide massively inflated, however the HD20, and P9657AA manuals reduced as expected/hoped.

freenas9.2.1_guide.pdf

P9657AA-Manual-EN-v1.0-090406.pdf HD20-M-en-GB.pdf

I cannot upload the 'safe' version of the converted freenas guide as it's over the upload file size limit.

If you need the converted versions of the other two I can upload those if you require, or if I can assist further do let me know.

Thank you.

apyrgio commented 3 weeks ago

Thanks for the link to the documents! I did a quick check and I can reproduce the size inflation you're noticing. However, I'm afraid it's kind of an expected side-effect of the way Dangerzone converts documents. The original file size does not affect the final file size, but the number of pages do.

You see, Dangerzone first renders each document page to pixels (RGB at 150 DPI), and then it reconstructs the document from said pixels. We did some measurements in https://github.com/freedomofpress/dangerzone/issues/526, and for typical A4 documents, each page should take about 6.22 MiB at 150 DPI. Let's see how this applies to your documents:

Document Pages Expected size (MiB) Final size (MiB)
freenas9.2.1_guide.pdf 280 1,741.6 89
MagPI 146 133 827.26 128

And here's where the compression comes into play. The table above tells us the following:

  1. The final file size is much less than the expected one. Compression is doing a good job there!
  2. The amount of graphics in a page affect the compression efficiency. You can see that the final MagPI document takes much more space than the FreeNAS guide, even though the MagPI document has half the pages! That's because the MagPI document has lots of pictures, graphics, whereas FreeNAS is more lean.

All in all, I think that Dangerzone can't do much better here, given the constraint that it has to convert pages to pixels. If your archiving method is doing something similar though, and you get better results, we'd like to know more.

In the meantime, I'll close this issue, but feel free to drop a comment.