avian2 / unidecode

ASCII transliterations of Unicode text - GitHub mirror
https://pypi.python.org/pypi/Unidecode
GNU General Public License v2.0
516 stars 62 forks source link

Map ・ and ー #71

Closed emiham closed 1 year ago

emiham commented 2 years ago

\u30fb and \u30fc are currently both mapped to an empty character. Is there a reason for this?

If not, \u30fc (ー) should probably be mapped to -, and \u30fb (・) to maybe ., although I'm slightly less sure about the latter.

Ge0rg3 commented 1 year ago

Disagree on the latter, they are used for different things

emiham commented 1 year ago

@Ge0rg3 What do you propose instead for the latter?

Ge0rg3 commented 1 year ago

・ is either:

As such, with three different options, maybe best to leave as is?

avian2 commented 1 year ago

I would prefer to go with *. This is the same replacement as unidecode currently uses for U+00B7 (·).

Hopefully that isn't too bad for the Japanese use case. On the other hand, Unidecode is already unsuitable for use with Japanese because of other reasons.

emiham commented 1 year ago

Good points, I think * is a good replacement. I don't think this presents an issue for Japanese, in fact personally I prefer it in Japanese over just a .

emiham commented 1 year ago

I went ahead and made a PR for this.

I'll also take this opportunity to ask: is there a reason why some characters like these are missing? Some characters I'm sure are just uncommon enough to never really become an issue, but these two are common enough that I'm sure I'm not the first person to run into this.

Speaking more practically: is it a good idea to just make a PR for any other missing characters one encounters, or is there perhaps a use case I'm not thinking of where they are better left untouched?

avian2 commented 1 year ago

Thanks for the pull request. I've merged it.

As far as I remember, this thread is the only time these specific characters came up in unidecode.

Technically, there are two different cases in Unidecode: an empty replacement (empty string - Unicode codepoint intentionally replaced with an empty ASCII string) and an unknown replacement (None - no one has looked into this code point yet and specified a replacement).

By default (errors="ignore"), unknown replacements are just replaced with an empty string because of backwards compatibility reasons.

Some characters are intentionally replaced with an empty string. However the initial source for Unidecode's tables (the Perl library of the same name) wasn't very strict in making a distinction between unknown and empty replacement. So many characters in the tables have an empty string as a replacement when in fact they should be set to None.

emiham commented 1 year ago

Thank you, that's good to know. I'll open another PR if I come across anything else that seems to be empty when it perhaps shouldn't be.