avian2 / unidecode

ASCII transliterations of Unicode text - GitHub mirror
https://pypi.python.org/pypi/Unidecode
GNU General Public License v2.0
517 stars 62 forks source link

U+1F6xx block: emoticons and dingbats #18

Open chungy opened 6 years ago

chungy commented 6 years ago

The emoticons might end up being especially contentious, with styles and opinions both varying wildly, and I've only tried to replicate ones that seem to be pretty simple.

For the emoji that aren't really representable in ASCII, I'm not sure what should be done. Maybe just the colon-sandwiched codes as seen in some messengers and GitHub? Eg, :cheese_wedge: or :wolf:

avian2 commented 6 years ago

Yes, emoticons are problematic. I would try to be consistent with the textual description provided by Unicode. e.g. have all emoticons where description says "open mouth" use "D" etc. In your patch for example U+1F603 "Smiling face with open mouth" is :-), but U+1F604 "Smiling face with open mouth and smiling eyes" is :-D. I would put both as :-D.

Regarding other emojis, I agree colon codes seem to be the best solution. Problems I see are:

But honestly I don't see any other solution. They seem to be the de-facto way people put emojis into plain ASCII.

chungy commented 6 years ago

Sorry it's taken so long for me to follow up on your comments, but I appreciate them a lot. Making some standards on the emoticons, matching representations based on the descriptions seems to be a good idea.

We can base the more graphic-oriented emojis similarly, making up colon syntaxes based on Unicode name rather than any specific application. My only problem is that the name of the character can be rather verbose and long.

As an example, 🖖 is represented in Keybase with :spock-hand:, but the Unicode name is “RAISED HAND WITH PART BETWEEN MIDDLE AND RING FINGERS”. I think that :raised-hand-with-part-between-middle-and-ring-fingers: is not quite desirable, but I don't know the best course here.

avian2 commented 6 years ago

I haven't seen actual Unicode names used in this way. I agree that using them for colon codes wouldn't be the best. The de-facto standard seems to be "short codes", like listed here:

https://www.webpagefx.com/tools/emoji-cheat-sheet

I don't think these are condoned by Unicode and I don't know where they originally came from. Some software library or Wordpress perhaps? I seem to remember seeing a page that listed differences in these codes between different services, but I can't find it at the moment.

Stealthii commented 6 years ago

Unidecode already translates based on romanisation and pronunciation of some foreign characters, and I believe the intention of this library is to convey representation - after all, our output is ambiguous at best, as the actual truth lies within the original unicode literal.

For this reason, I think the best effort approach is to convey the clearest meaning, and I would suggest the cheat-sheet that @avian2 linked, as the shortcodes described there are a de-facto implementation used by most chat clients.

There is nothing to say we can't change this in future, right?

mvasilkov commented 6 years ago

May I suggest not using smiley faces made of punctuation? IMO shortcodes like :open_mouth: are much better than :-O.

avian2 commented 6 years ago

On the other hand, :open_mouth: is language-specific. In my opinion punctuation smileys (where applicable) would actually be more universal in that respect.

The problem of a smiley possibly merging with an adjacent word can be solved by surrounding them with leading and/or trailing space. Unidecode already does that for some symbols.

mvasilkov commented 6 years ago

Ah, I assumed it was targeting English the whole time, since it mentions the US keyboard layout in the readme, and also this: https://github.com/avian2/unidecode/blob/master/unidecode/x033.py

But fair enough I guess.

avian2 commented 6 years ago

I wouldn't like to target English any more than the fact that Unidecode transliterates into a character set with an US origin. US keyboard layout is mentioned because that is the most common layout used to enter ASCII text. I used it as an illustration of what problem Unidecode tries to solve - imaging a person trying to enter non-English words into a computer that only accepts ASCII through an American keyboard.

I am not familiar with the U+33xx Unicode page you mention. Codepoint descriptions suggest these represent those specific English words (I'm guessing for use in Japanese text?)

To be honest, I see no perfect solution for emojis, and Unidecode is about compromise. I think English short-codes as discussed above would be a good start.