aenygma / hashdex

A file indexer based on content hashes to quickly find duplicate files on your system.
https://hashdex.readthedocs.io
MIT License
0 stars 0 forks source link

Filenames with no encoding cause problems with database and cli printing #3

Open aenygma opened 5 years ago

aenygma commented 5 years ago

Description

This is a class of bugs for the unicode related problems when encoding of filenames is not specified.

What I Did

File with name of "'Calexico - Feast of Wire - 13 - G\udcfcero Canelo.mp3" exists. hashdex chokes on

  1. inserting the name into sqlite raises exception
  2. click's progressbar throws exception when printing to screen
Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
aenygma commented 5 years ago

When python does a readdir/equivalent its getting the filenames as binary data, then encoding them to utf-8 for non-machine/human purposes.

Therefore, fix should hold reversible translation for machines, and irreversible for human readability when the prior is not feasilbe.

The fix here seems to be:

  1. when reading file names from filesystem or database, encode/decode them with surrogate escape
  2. when printing them, use 'replace', since printed filename is not expected to reflect the binary equivalent (in the case of surrogate chars)

That way filename will be addressed by hashdex as binary representation for all purposes, except when having to print for user's convenience.

aenygma commented 5 years ago

Example:

Straight up printing fails

>>> fname                                                                                                  
'Calexico - Feast of Wire - 13 - G\udcfcero Canelo.mp3'                                                    
>>> os.path.exists(fname)                                                                                  
True                                                                                                          
>>> print(fname)                                                                                           
Traceback (most recent call last):                                                                         
  File "<stdin>", line 1, in <module>                                                                      
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcfc' in position 33: surrogates not allowed

Encoding them, preserves filesystem functionality (open/read/close syscalls)

>>> print(fname.encode('utf-8', 'surrogateescape'))                                                        
b'Calexico - Feast of Wire - 13 - G\xfcero Canelo.mp3'                                                     
>>> os.path.exists(fname.encode('utf-8', 'surrogateescape'))                                               
True                                                                                                       
>>> os.path.exists(fname.encode('utf-8', 'surrogateescape').decode('utf-8', 'surrogateescape'))            
True                                                                                                       

When printing is needed, it can be decoded to replace troublesome chars

>>> print(fname.encode('utf-8', 'surrogateescape').decode('utf-8', 'replace'))                             
Calexico - Feast of Wire - 13 - G�ero Canelo.mp3