Closed deeplow closed 8 months ago
Thanks for the review. This can only be merged after #627, so we'll have to wait.
@apyrgio I sneaked in da684dd in this PR since it is more topically related to this PR than in the page streaming PR (this was a leftover from the PyMuPDF integration. Plus, it's small enought that I don't think it deserves its own PR.
@apyrgio I sneaked in https://github.com/freedomofpress/dangerzone/commit/da684dd76840ae2b44ba9b5c2571a947a346cd14 in this PR since it is more topically related to this PR than in the page streaming PR (this was a leftover from the PyMuPDF integration. Plus, it's small enought that I don't think it deserves its own PR.
Verbally @apyrgio confirmed that this was fine to include. So I'll merge this now.
Interesting... GitHub still thinks this is a massive PR after this branch was just 2 commits after the tip of main (with the merging of #627)
PyMuPDF versions lower than 1.22.5 pass the tesseract data path as an argument to
pixmap.pdfocr_tobytes()
1, but lower versions require setting instead the TESSDATA_PREFIX environment variable 2.Because on Qubes the pixels to pdf conversion happens on the host and Qubes has a lower PyMuPDF package version, we need to pass instead via environment variable.
NOTE: the TESSDATA_PREFIX env. variable was set in dangerzone-cli instead of closer to the calling method in
doc_to_pixels.py
since PyMuPDF reads this variable as soon as the fitz module is imported 3.Fixes #682