Filenames with no encoding cause problems with database and cli printing

aenygma commented 5 years ago

Hashdex version: 0.6.0
Python version: Python 3.7.3
Operating System: FreeBSD 11.2-STABLE

Description

This is a class of bugs for the unicode related problems when encoding of filenames is not specified.

What I Did

File with name of "'Calexico - Feast of Wire - 13 - G\udcfcero Canelo.mp3" exists. hashdex chokes on

inserting the name into sqlite raises exception
click's progressbar throws exception when printing to screen

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

aenygma commented 5 years ago

When python does a readdir/equivalent its getting the filenames as binary data, then encoding them to utf-8 for non-machine/human purposes.

Therefore, fix should hold reversible translation for machines, and irreversible for human readability when the prior is not feasilbe.

The fix here seems to be:

when reading file names from filesystem or database, encode/decode them with surrogate escape
when printing them, use 'replace', since printed filename is not expected to reflect the binary equivalent (in the case of surrogate chars)

That way filename will be addressed by hashdex as binary representation for all purposes, except when having to print for user's convenience.

aenygma commented 5 years ago

Example:

Straight up printing fails

>>> fname                                                                                                  
'Calexico - Feast of Wire - 13 - G\udcfcero Canelo.mp3'                                                    
>>> os.path.exists(fname)                                                                                  
True                                                                                                          
>>> print(fname)                                                                                           
Traceback (most recent call last):                                                                         
  File "<stdin>", line 1, in <module>                                                                      
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcfc' in position 33: surrogates not allowed

Encoding them, preserves filesystem functionality (open/read/close syscalls)

>>> print(fname.encode('utf-8', 'surrogateescape'))                                                        
b'Calexico - Feast of Wire - 13 - G\xfcero Canelo.mp3'                                                     
>>> os.path.exists(fname.encode('utf-8', 'surrogateescape'))                                               
True                                                                                                       
>>> os.path.exists(fname.encode('utf-8', 'surrogateescape').decode('utf-8', 'surrogateescape'))            
True

When printing is needed, it can be decoded to replace troublesome chars

>>> print(fname.encode('utf-8', 'surrogateescape').decode('utf-8', 'replace'))                             
Calexico - Feast of Wire - 13 - G�ero Canelo.mp3

aenygma / hashdex

Filenames with no encoding cause problems with database and cli printing #3

Description

What I Did