Open aenygma opened 5 years ago
When python does a readdir/equivalent its getting the filenames as binary data, then encoding them to utf-8 for non-machine/human purposes.
Therefore, fix should hold reversible translation for machines, and irreversible for human readability when the prior is not feasilbe.
The fix here seems to be:
That way filename will be addressed by hashdex as binary representation for all purposes, except when having to print for user's convenience.
Example:
Straight up printing fails
>>> fname
'Calexico - Feast of Wire - 13 - G\udcfcero Canelo.mp3'
>>> os.path.exists(fname)
True
>>> print(fname)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcfc' in position 33: surrogates not allowed
Encoding them, preserves filesystem functionality (open/read/close syscalls)
>>> print(fname.encode('utf-8', 'surrogateescape'))
b'Calexico - Feast of Wire - 13 - G\xfcero Canelo.mp3'
>>> os.path.exists(fname.encode('utf-8', 'surrogateescape'))
True
>>> os.path.exists(fname.encode('utf-8', 'surrogateescape').decode('utf-8', 'surrogateescape'))
True
When printing is needed, it can be decoded to replace troublesome chars
>>> print(fname.encode('utf-8', 'surrogateescape').decode('utf-8', 'replace'))
Calexico - Feast of Wire - 13 - G�ero Canelo.mp3
Description
This is a class of bugs for the unicode related problems when encoding of filenames is not specified.
What I Did
File with name of "'Calexico - Feast of Wire - 13 - G\udcfcero Canelo.mp3" exists. hashdex chokes on