freedomofpress / dangerzone

Take potentially dangerous PDFs, office documents, or images and convert them to safe PDFs
https://dangerzone.rocks/
GNU Affero General Public License v3.0
3.36k stars 153 forks source link

Fix printing of filenames with surrogate escapes #769

Closed naglis closed 2 months ago

naglis commented 2 months ago

On Unix systems a filename can be a sequence of bytes that is not valid UTF-8. Python uses1 surrogate escapes to allow to decode such filenames to Unicode (bytes that cannot be decoded are replaced by a surrogate; upon encoding the surrogate is converted to the original byte).

From click docs2:

Invalid bytes or surrogate escapes will raise an error when written to a stream with errors="strict". This will typically happen with stdout when the locale is something like en_GB.UTF-8.

~To fix that, we use click.format_filename2 before printing the filenames to stdout so that surrogate escapes are replaced by �.~ Update: it was decided (see comment https://github.com/freedomofpress/dangerzone/pull/769#discussion_r1557187002) to instead use replace_control_chars() and also update its implementation to use Unicode General Category values to decide which characters to replace.

Fixes #768

apyrgio commented 2 months ago

Quick FYI, I'm really stoked to check out the code here! As you may have noticed, we're currently prepping up for a hotfix release (0.6.1), so ultimately those PRs have higher priority for now. But rest assured I'll review this PR soon :slightly_smiling_face:

apyrgio commented 2 months ago

Hey @naglis, I went ahead and refactored your branch a bit, with the following changes:

  1. I added a new uncommon_filename fixture, which is basically uncommon_text, with the exception of the invalid Unicode character that doesn't work in macOS filenames. I use this fixture instead of your string with Unicode surrogate escapes, so that we can have cross-platform tests for files with uncommon Unicode characters in them.
  2. I squashed some commits, so that the Git history does not contain click.format_filename().
  3. I retained your Git authorship where applicable.

(Also, sorry for force-pushing in your branch. For reference, the original branch is tracked infix-printing-filename-naglis in this repo)

If you are ok with the changes, I can resolve the conversations and merge this PR in time for 0.6.1. Else, we can discuss more and merge it afterwards. No pressure in any case :slightly_smiling_face:

naglis commented 2 months ago

If you are ok with the changes, I can resolve the conversations and merge this PR in time for 0.6.1. Else, we can discuss more and merge it afterwards. No pressure in any case 🙂

Fine with me, the changes look good. Thanks for collaborating!