avian2 / unidecode

ASCII transliterations of Unicode text - GitHub mirror
https://pypi.python.org/pypi/Unidecode
GNU General Public License v2.0
517 stars 62 forks source link

Soft hyphen #10

Closed xeor closed 7 years ago

xeor commented 7 years ago

The character u"\xad" (http://www.fileformat.info/info/unicode/char/00ad/index.htm) looks exactly the same as a normal hyphen (http://www.fileformat.info/info/unicode/char/2010/index.htm).

If I try to normalize it to ascii, it ends up as '' however;

 unidecode.unidecode(u"\xad")
 ''

Other nasty characters, like the HORIZONTAL ELLIPSIS (http://www.fileformat.info/info/unicode/char/2026/index.htm) is normalized correctly.

unidecode.unidecode(u"\u2026")
'...'

Can the SOFT HYPHEN also be normalized into an -?

avian2 commented 7 years ago

Unicode classifies character U+00AD as "non-printable". It is not supposed to behave like a normal hyphen, but rather as an invisible control character that marks possible line breaks.

For a longer discussion, see https://en.wikipedia.org/wiki/Soft_hyphen and the Unicode interpretation of SOFT HYPHEN letter linked in its references.

In general I think Unidecode behavior is correct.

xeor commented 7 years ago

Thanks for the quick response!

I see your point. But isn't some of the goal of unidecode to ascify, even if it means loosing meaning of text (Unidecode, lossy ASCII transliterations of Unicode text). Personally, in that statement, and this "soft hyphen" issue, I would expect unidecode to break the unicode rules and make a hyphen.

If I would fix a unicode problem, or "follow the rules", I would use something like https://github.com/LuminosoInsight/python-ftfy..

Sorry if I'm missing the point :)

avian2 commented 7 years ago

Unidecode tries to preserve the meaning as much as possible, given its constraints (non-language specific character-by-character substitution). Also, one of the assumptions it makes is that the input string is valid Unicode according to the standard (in contrast to the ftfy you linked)

The "lossy" in the title comes from the fact that in most cases you can't get the original Unicode string back from Unidecode's output.

xeor commented 7 years ago

ok, I see..

I am pushing this because I saw the misuse of this SOFT HYPHEN in the wild, and it bit me. It was mixed with some normal hyphens. I backtracked this to coming from excel, but I have no clue how it ended up in that list.

Do you know of any other tools that breaks the rules then? Or would a .replace() be my solution? Or are you reconsidering based on this comment?

avian2 commented 7 years ago

I suggest you do a .replace() on your strings to correct the hyphens before passing them to Unidecode.

If I change the Unidecode behavior then somebody that uses the hyphens correctly will have an inverse problem. I think it's better if Unidecode follows the Unicode standard regarding this.