freedomofpress / dangerzone

Take potentially dangerous PDFs, office documents, or images and convert them to safe PDFs
https://dangerzone.rocks/
GNU Affero General Public License v3.0
3.59k stars 170 forks source link

Fix OCR on Qubes: PyMuPDF required TESSDATA_PREFIX #686

Closed deeplow closed 8 months ago

deeplow commented 8 months ago

NOTE: to be merged after https://github.com/freedomofpress/dangerzone/pull/627 since Qubes support for some reason was not working on main. The stream pages PR must have fixed something (can't recall exactly).

PyMuPDF versions lower than 1.22.5 pass the tesseract data path as an argument to pixmap.pdfocr_tobytes() 1, but lower versions require setting instead the TESSDATA_PREFIX environment variable 2.

Because on Qubes the pixels to pdf conversion happens on the host and Qubes has a lower PyMuPDF package version, we need to pass instead via environment variable.

NOTE: the TESSDATA_PREFIX env. variable was set in dangerzone-cli instead of closer to the calling method in doc_to_pixels.py since PyMuPDF reads this variable as soon as the fitz module is imported 3.

Fixes #682

deeplow commented 8 months ago

Thanks for the review. This can only be merged after #627, so we'll have to wait.

deeplow commented 8 months ago

@apyrgio I sneaked in da684dd in this PR since it is more topically related to this PR than in the page streaming PR (this was a leftover from the PyMuPDF integration. Plus, it's small enought that I don't think it deserves its own PR.

deeplow commented 8 months ago

@apyrgio I sneaked in https://github.com/freedomofpress/dangerzone/commit/da684dd76840ae2b44ba9b5c2571a947a346cd14 in this PR since it is more topically related to this PR than in the page streaming PR (this was a leftover from the PyMuPDF integration. Plus, it's small enought that I don't think it deserves its own PR.

Verbally @apyrgio confirmed that this was fine to include. So I'll merge this now.

deeplow commented 8 months ago

Interesting... GitHub still thinks this is a massive PR after this branch was just 2 commits after the tip of main (with the merging of #627)