freedomofpress / dangerzone

Take potentially dangerous PDFs, office documents, or images and convert them to safe PDFs
https://dangerzone.rocks/
GNU Affero General Public License v3.0
3.39k stars 155 forks source link

PyMuPDF logging prints to stdout #700

Closed deeplow closed 4 months ago

deeplow commented 5 months ago

While converting a document on the page streaming PR I have just noticed the following:

json

Here's the line in question. It turns out that PyMuPDF's logging simply prints the error. Fortunately I haven't seen this in the pixels_to_PDF, but if it does happen it'll interfere with out page streaming since it assumes all stdout is pixel data.

deeplow commented 5 months ago

This is probably a release blocker since it will interfere with the GUI as well:

visually

apyrgio commented 5 months ago

There's something that does not make sense to me: how come you're hitting this issue, given that the offending function is in the fitz_new module, and not in the regular fitz module that we copy into the container?

https://github.com/freedomofpress/dangerzone/blob/93bf0af3480b0ce69f88ae609c68b828dfa6a0ff/Dockerfile#L64

Can you give a bit more info about the environment you are seeing this at?

apyrgio commented 5 months ago

Also, I've opened an issue in the PyMuPDF repo: https://github.com/pymupdf/PyMuPDF/issues/3135

deeplow commented 5 months ago

There's something that does not make sense to me: how come you're hitting this issue, given that the offending function is in the fitz_new module, and not in the regular fitz module that we copy into the container?

I was manually looking for the __init__.py so I probably find the wrong one. But now looking at the src_classic, __init__.py is much smaller. So I don't think it's that init they're talking about.

I am running this on Fedora 38 on Qubes. I'll try another environment to see if there are any differences.

deeplow commented 5 months ago

And thanks for filing the upstream issue!

apyrgio commented 5 months ago

Yeap, turns out that in 1.23.9, the fitz_new module became fitz, and fitz became fitz_old: (changelog, rationale). Our use case is limited, and since our CI tests pass, it's probably best to follow suit.

deeplow commented 5 months ago

Curious that this happened, especially in a point release. Here they have explained that:

The best way to protect against this happening is to try out the rebased implementation in stage 1 by using import fitz_new as fitz, and report any problems so they can be fixed.

But if there isn't even a major version change how are users supposed know to check that there is a migration in progress? When users notice things is when things break, which is what happened here...

Per discussion with @apyrgio we'll be trying to stick with fitz_old at least for this release due to this very issue.

deeplow commented 5 months ago

The following may be a working solution:

diff --git a/Dockerfile b/Dockerfile
index 4b794b2d..b7e53091 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -61,7 +61,8 @@ RUN apk --no-cache -U upgrade && \
     py3-magic \
     font-noto-cjk

-COPY --from=pymupdf-build /usr/lib/python3.11/site-packages/fitz/ /usr/lib/python3.11/site-packages/fitz
+COPY --from=pymupdf-build /usr/lib/python3.11/site-packages/fitz /usr/lib/python3.11/site-packages/fitz
+COPY --from=pymupdf-build /usr/lib/python3.11/site-packages/fitz_old /usr/lib/python3.11/site-packages/fitz_old
 COPY --from=tessdata-dl /usr/share/tessdata/ /usr/share/tessdata
 COPY --from=h2orestart-dl /libreoffice_ext/ /libreoffice_ext

diff --git a/dangerzone/conversion/doc_to_pixels.py b/dangerzone/conversion/doc_to_pixels.py
index 5f1abf3b..2bce0cf2 100644
--- a/dangerzone/conversion/doc_to_pixels.py
+++ b/dangerzone/conversion/doc_to_pixels.py
@@ -7,7 +7,10 @@ import shutil
 import sys
 from typing import Dict, List, Optional, TextIO

-import fitz
+try:
+    import fitz_old as fitz
+except:
+    import fitz
 import magic

 from . import errors
diff --git a/dangerzone/conversion/pixels_to_pdf.py b/dangerzone/conversion/pixels_to_pdf.py
index 0243858d..e134680f 100644
--- a/dangerzone/conversion/pixels_to_pdf.py
+++ b/dangerzone/conversion/pixels_to_pdf.py
@@ -25,7 +25,10 @@ class PixelsToPDF(DangerzoneConverter):
             tempdir = "/safezone"

         # XXX lazy loading of fitz module to avoid import issues on non-Qubes systems
-        import fitz
+        try:
+            import fitz_old as fitz
+        except:
+            import fitz

         num_pages = len(glob.glob(f"{tempdir}/pixels/page-*.rgb"))
         total_size = 0.0

Note: do need to keep both fitz_old and fitz in the container image because if we rename the fitz_old module as fitz in the container image, this line fails since fitz_old is underfined. Even with a symlink fitz_old -> fitz doesn't help here.

@apyrgio what do you think? This solution works both in containers as well as in Qubes and it only adds 22MB to the container image (when compressed).

apyrgio commented 5 months ago

There was another suggestion, of pinning PyMuPDF to 1.23.8 (now currently at 1.23.21), at least for this release. I looked into the PyMuPDF source, and I don't see any significant code changes in the src_classic/ directory of PyMuPDF since 1.23.8, so we should be good: