Lakshmipathi / dduper

Fast block-level out-of-band BTRFS deduplication tool.
GNU General Public License v2.0
168 stars 18 forks source link

Throws 'UnicodeEncodeError' on strange filename #15

Closed plattrap closed 4 years ago

plattrap commented 4 years ago

Backed up an old Windows disk onto a BTRFS backed network share. Now dduper throws an exception on one of the filenames.

ls gives the filename as: 'Finland.J'$'\344''rvenp'$'\344\344''-Elisa.xml'

Traceback (most recent call last):
  File "/usr/sbin/dduper", line 535, in <module>
    main(results)
  File "/usr/sbin/dduper", line 426, in main
    dedupe_dir(results.dir_path, results.dry_run, results.recurse)
  File "/usr/sbin/dduper", line 409, in dedupe_dir
    if validate_file(fn) is True:
  File "/usr/sbin/dduper", line 399, in validate_file
    file size < 4kb ")
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce4' in position 146: surrogates not allowed

Using the docker image. sudo docker run -it --device /dev/sdc -v /media/backup/:/mnt laks/dduper dduper --device /dev/sda1 --dir /mnt --analyze --recurse

Lakshmipathi commented 4 years ago

Thanks for the report. Seems like bug while traversing directory content with uft8 file-names. Let me check.

Lakshmipathi commented 4 years ago

Are you using the latest docker image ? Try (docker pull laks/dduper). For me that file name seem to work.


Skipped /mnt/Finland.J$344rvenp$344344-Elisa.xml not unique regular files or             file size < 4kb 
Perfect match :  /mnt/a1 /mnt/Показатели      
Perfect match :  /mnt/a1 /mnt/Показат                 
Perfect match :  /mnt/a1 /mnt/Finland.J'$'\344''rvenp'$'\344\344''-Elisa.xml
Skipped /mnt/Finland.J$344rvenp$344344-Elisa.xml not unique regular files or             file size < 4kb 
Perfect match :  /mnt/a1 /mnt/Показатели      
Perfect match :  /mnt/a1 /mnt/Показат                 
Perfect match :  /mnt/a1 /mnt/Finland.J'$'\344''rvenp'$'\344\344''-Elisa.xml
Skipped /mnt/Finland.J$344rvenp$344344-Elisa.xml not unique regular files or             file size < 4kb 
Perfect match :  /mnt/a1 /mnt/Показатели      
Perfect match :  /mnt/a1 /mnt/Показат                 
Perfect match :  /mnt/a1 /mnt/Finland.J'$'\344''rvenp'$'\344\344''-Elisa.xml
Skipped /mnt/Finland.J$344rvenp$344344-Elisa.xml not unique regular files or             file size < 4kb 
Perfect match :  /mnt/a1 /mnt/Показатели      
Perfect match :  /mnt/a1 /mnt/Показат                 
Perfect match :  /mnt/a1 /mnt/Finland.J'$'\344''rvenp'$'\344\344''-Elisa.xml
Skipped /mnt/Finland.J$344rvenp$344344-Elisa.xml not unique regular files or             file size < 4kb 
Perfect match :  /mnt/a1 /mnt/Показатели      
Perfect match :  /mnt/a1 /mnt/Показат                 
Perfect match :  /mnt/a1 /mnt/Finland.J'$'\344''rvenp'$'\344\344''-Elisa.xml
Skipped /mnt/Finland.J$344rvenp$344344-Elisa.xml not unique regular files or             file size < 4kb 
Perfect match :  /mnt/a1 /mnt/Показатели      
Perfect match :  /mnt/a1 /mnt/Показат                 
Perfect match :  /mnt/a1 /mnt/Finland.J'$'\344''rvenp'$'\344\344''-Elisa.xml
+----------------+-------------------------------------------------------------+---------------+
| Chunk Size(KB) |                            Files                            | Duplicate(KB) |
+----------------+-------------------------------------------------------------+---------------+
|      256       |                       /mnt/a1:/mnt/a2                       |     51200     |
|      256       |                   /mnt/a1:/mnt/Показатели                   |     51200     |
|      256       |                     /mnt/a1:/mnt/Показат                    |     51200     |
|      256       | /mnt/a1:/mnt/Finland.J'$'\344''rvenp'$'\344\344''-Elisa.xml |     51200     |
+----------------+-------------------------------------------------------------+---------------+
dduper:204800KB of duplicate data found with chunk size:256KB 

+----------------+-------------------------------------------------------------+---------------+
| Chunk Size(KB) |                            Files                            | Duplicate(KB) |
+----------------+-------------------------------------------------------------+---------------+
|      512       |                       /mnt/a1:/mnt/a2                       |     51200     |
|      512       |                   /mnt/a1:/mnt/Показатели                   |     51200     |
|      512       |                     /mnt/a1:/mnt/Показат                    |     51200     |
|      512       | /mnt/a1:/mnt/Finland.J'$'\344''rvenp'$'\344\344''-Elisa.xml |     51200     |
+----------------+-------------------------------------------------------------+---------------+
dduper:204800KB of duplicate data found with chunk size:512KB 
plattrap commented 4 years ago

Some more detail on the file name, I dumped the file system representation of it and another copy with a better encoding.

Problem seems to be on the bad file name, the ä is encoded as a e4 byte, and on the good one as c3 a4. So Python3 is trying to decode the sequence e4 72 as "utf-8" and not as two characters in "iso-8859" är. Wikipedia

The solution probably is to treat file and directory names as byte strings, and do some extra checks before displaying them?

Finland.Järvenpää-Elisa.xml
46 69 6e 6c 61 6e 64 2e 4a c3 a4 72 76 65 6e 70 c3 a4 c3 a4 2d 45 6c 69 73 61 2e 78 6d 6c

Finland.J�rvenp��-Elisa.xml
46 69 6e 6c 61 6e 64 2e 4a e4 72 76 65 6e 70 e4 e4 2d 45 6c 69 73 61 2e 78 6d 6c

Zip of the two files attached: F_test.zip

Lakshmipathi commented 4 years ago

Thanks for the details and zip file. It helped a lot during testing. Please pull latest docker image it should work now. I replaced print(filename) to print(repr(filename)). With this change, docker prints the following:

Perfect match :  '/mnt/f/Finland.Järvenpää-Elisa.xml' '/mnt/f/a'
+----------------+---------------------------------------------+---------------+
| Chunk Size(KB) |                    Files                    | Duplicate(KB) |
+----------------+---------------------------------------------+---------------+
|      256       | /mnt/f/Finland.Järvenpää-Elisa.xml:/mnt/f/a |     51200     |
+----------------+---------------------------------------------+---------------+

ps: this fix is only part of docker image, need to add it repo master branch.

plattrap commented 4 years ago

Thanks, works with the latest docker image.

Lakshmipathi commented 4 years ago

thanks @plattrap for the confirmation. I'll go ahead and mark this as resolved. Please report any issues if you encounter.