Closed naglis closed 2 months ago
Quick FYI, I'm really stoked to check out the code here! As you may have noticed, we're currently prepping up for a hotfix release (0.6.1), so ultimately those PRs have higher priority for now. But rest assured I'll review this PR soon :slightly_smiling_face:
Hey @naglis, I went ahead and refactored your branch a bit, with the following changes:
uncommon_filename
fixture, which is basically uncommon_text
, with the exception of the invalid Unicode character that doesn't work in macOS filenames. I use this fixture instead of your string with Unicode surrogate escapes, so that we can have cross-platform tests for files with uncommon Unicode characters in them.click.format_filename()
.(Also, sorry for force-pushing in your branch. For reference, the original branch is tracked infix-printing-filename-naglis
in this repo)
If you are ok with the changes, I can resolve the conversations and merge this PR in time for 0.6.1. Else, we can discuss more and merge it afterwards. No pressure in any case :slightly_smiling_face:
If you are ok with the changes, I can resolve the conversations and merge this PR in time for 0.6.1. Else, we can discuss more and merge it afterwards. No pressure in any case 🙂
Fine with me, the changes look good. Thanks for collaborating!
On Unix systems a filename can be a sequence of bytes that is not valid UTF-8. Python uses1 surrogate escapes to allow to decode such filenames to Unicode (bytes that cannot be decoded are replaced by a surrogate; upon encoding the surrogate is converted to the original byte).
From
click
docs2:~To fix that, we use
click.format_filename
2 before printing the filenames tostdout
so that surrogate escapes are replaced by �.~ Update: it was decided (see comment https://github.com/freedomofpress/dangerzone/pull/769#discussion_r1557187002) to instead usereplace_control_chars()
and also update its implementation to use Unicode General Category values to decide which characters to replace.Fixes #768