dharple / detox

Tames problematic filenames
BSD 3-Clause "New" or "Revised" License
318 stars 19 forks source link

Is `utf_8-only` doing more stuff (converting e.g. brackets) than it is needed? #86

Closed delphym closed 2 years ago

delphym commented 3 years ago

Hello there,

I was wondering what am I doing wrong? I was hopping when I run: detox -n -v -s utf_8-only /Volumes/sda2/Videos/Films/ which should be using sequence from /usr/local/etc/detoxrc which is defined as:

# transliterates UTF-8 to ASCII
sequence "utf_8-only" {
   utf_8;
};

I am getting the following: /Volumes/sda2/Videos/Films//The_Bourne_Supremacy_(Bournův_mýtus)_2004 -> /Volumes/sda2/Videos/Films//The_Bourne_Supremacy__Bournuv_mytus__2004

I would expect to get following conversation based on the info from HACKING-v1.md: /Volumes/sda2/Videos/Films//The_Bourne_Supremacy_(Bournův_mýtus)_2004 -> /Volumes/sda2/Videos/Films//The_Bourne_Supremacy_(Bournuv_mytus)_2004

Also, given the fact, I modified /usr/local/share/detox/safe.tbl and here're changes just FYI:

ζ diff /usr/local/share/detox/safe.tbl /usr/local/share/detox/safe.tbl.sample                                                                                                                                                    [d14fc2db1] 
95,99d94
< 0x28      (
< 0x29      )
< 0x5b      [
< 0x5d      ]
< 
128,131c123,126
< #0x28     -   # (
< #0x29     -   # )
< #0x5b     -   # [
< #0x5d     -   # ]
---
> 0x28      -   # (
> 0x29      -   # )
> 0x5b      -   # [
> 0x5d      -   # ]

So, I would still expect the brackets won't get converted to - if I run detox with "full" utf_8 or default sequence: detox -n -v -s utf_8 /Volumes/sda2/Videos/Films/ But the "opposite" is true:-( /Volumes/sda2/Videos/Films//The_Bourne_Supremacy_(Bournův_mýtus)_2004 -> /Volumes/sda2/Videos/Films//The_Bourne_Supremacy_Bournuv_mytus_2004

For full picture, I'm attaching:

Note I'm using macOs Mojave, latest stable detox v1.4.5 installed via Homebrew

dharple commented 2 years ago

Add the lines you added to safe.tbl to unicode.tbl, and you should see the desired result.

0x28        (
0x29        )
0x5b        [
0x5d        ]

The utf_8 filter loads its translation table from unicode.tbl. The safe filter uses safe.tbl.

Currently, they aren't specified in unicode.tbl, so they are being converted to the default, which is set to _.

There is no way to override this behavior in detox 1.x, but it's coming in detox 2.