Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
712 stars 132 forks source link

Display CJK characters with the same codepoint differently by language #222

Closed alanfgh closed 9 years ago

alanfgh commented 10 years ago

tommy_san writes:

An important issue related to fonts: Forms of kanji/hanzi are sometimes wrong!

This results from the fact that Unicode assigns the same codepoint to characters in different languages when they look alike. For example, even though the character 与 used in Japanese (the middle one in the lower row in the link below) is different in form from that used in other languages, they all share one codepoint U+0420. http://www.scarfboy.com/coding/unicode-tool?s=U%2B4e0e

Currently, on my computer, the 与 in Simplified Chinese sentences are displayed in the Japanese form. I've also seen that on another computer, the 与 in Japanese sentences are not displayed in the correct form. How is it on your computer? http://tatoeba.org/jpn/sentences/search?query=%5E%E4%B8%8E&from=und&to=und

It seems that this problem can be solved by adding a "lang" attribute to the

tag. You might then want to specify the fonts for each language.

jiru commented 9 years ago

Some more information about that problem.

  • Specifying the lang attribute is indeed the right way to make the browser rendering with the right font.
  • Chromium ignores the lang attribute and uses the browser’s language instead. Firefox works.
  • Here is a very useful table to test your browser.
  • The problem extends to Japanese, simplified Chinese, traditional Chinese. Vietnamese and Korean are also concerned but as far as I know we don’t use ideogram script for them. That being said, it’s a good practice to specify a lang attribute for all the languages.
  • We can’t just use iso-639 3-letters codes for the lang attribute, because it has its own set of tags, and we need to specify the script for Chinese (simplified or traditional). In Firefox, lang="cmn" doesn’t work but lang="zh" does.
  • On Tatoeba, both sentences and the whole site need to have a lang attribute.
tommy-3 commented 9 years ago

Is it possible to automatically detect the language(s) of comments, Wall messages and private messages? It would be ideal if all parts were displayed properly when more than one languages are used in one message, especially when LTR and RTL languages (http://prntscr.com/5xh5hx) or when Japanese and Chinese are used.

jiru commented 9 years ago

@tommy-3 That would be quite a bit of a challenge to make such a thing fully automatic.

It would be indeed possible to automatically detect the language of comments, Wall messages and private messages using the sentence language detector tatodetect. But that would not work with messages that are written in more than a single language, which is a very very common scenario in Tatoeba. And it wouldn’t work for languages that are not in Tatoeba, such as Vietnamese or Korean written in Chinese characters (although probably quite rare).

Language bounds in a text that mixes RTL and LTR languages are already calculated automatically by browsers, and it’s not working perfectly, especially when it includes punctuation like in your screenshot, because these characters are part of both language’s writing system (we already dealt with that in #357). I doubt we can beat what browsers are currently capable of.

Sorting Japanese, traditional Chinese and simplified Chinese out of a text that mixes them seems even more difficult especially without clear borders between each.

To limit the scope of this ticket, I suggest we start by setting the interface language for the whole page, the language of the sentences, and surround other user-provided contents with lang="" (undefined).

tommy-3 commented 9 years ago

OK. It might be better to assume that comments on sentences are written in the same language as the sentences themselves.

jiru commented 9 years ago

This is mostly fixed and the result is available on http://dev.tatoeba.org. The only remaining thing is the language selection drop down itself.

tommy-3 commented 9 years ago

http://dev.tatoeba.org/jpn/sentences/show/678956

他雇用了新秘书。
他雇用了新秘書。

Don't you distinguish between the simplified and the traditional Chinese?

jiru commented 9 years ago

Ah yes, we need to do that too.

tommy-3 commented 9 years ago

And it seems that «lang="yue"», «lang="wuu"» and «lang="lzh"» isn't working. It would be better to say that Cantonese and Shanghainese use the simplified characters and the Literary Chinese uses the traditional characters.

jiru commented 9 years ago

In what script Cantonese and Shanghainese sentences are actually written on Tatoeba, traditional or simplified? If not 100% of the sentences are using the same script, we’d better leave it unspecified for the browsers.

tommy-3 commented 9 years ago

I think Shanghainese is almost always written with simplified characters. I found 14 sentences that use traditional characters, but it seems to me that they're simply mistakes.

Cantonese is written with both traditional and simplified characters. On Tatoeba, most sentences are written with traditional characters. I found 5 sentences that use simplified characters. 708524 我除咗洋葱之外咩都食。 (nickyeow) 1866973 今日唔使返学。 (fercheung) 1878026 阿Tom同阿Mary倾咗成晚偈。 (fercheung) 2864415 早知到,就唔好借钱卑佢。 (\N) 3081247 以咖啡开始你的早晨。 (hsuan07)

It would be better to detect which one is used. You'll need to do it for Mandarin anyway, right?

jiru commented 9 years ago

It would be better to detect which one is used. You'll need to do it for Mandarin anyway, right?

No, I don’t need to do it for Mandarin because we’re already doing it. As explained in the FAQ, contributors may add sentences in both simplified and traditional, and an icon shows the detected script (that icon disappeared for a while but I just repaired it yesterday while I was at it). While this is all clear for Chinese, it’s not for Cantonese and Shanghainese, so I’d like to be 100% about what I’m doing.

@trang @allan-simon I’d like to know the current policy for sentences in Cantonese and Shanghainese. Are users free to use both traditional and simplified scripts, or do we assume a unique script? If both scripts are allowed, shouldn’t we put a simplified/traditional icon like on Chinese sentences?

trang commented 9 years ago

I don't think we've established any policy as to which script to use in Cantonese and Shanghainese. I actually don't know if these languages accept both script or not, but if they do, we should indeed use the script detection just like Mandarin Chinese.

allan-simon commented 9 years ago

both are correct for Cantonese, because you can get people from Guandong province, mainland China, writing and the official script of mainland China is simplified Chinese, while people from Hongkong (who also speak Cantonese) are much more familiar with traditional character, stating a "preferred" script is likely to become a political troll topic.

Shanghainese is nowadays mostly written in simplified Chinese, but before Mandarin existed, books were written in Shanghainese in traditional.

I think Chinese is not the only language with that things and the actual real solution is to take the bold move of considering that a sentence can have several "script" of equal "value" (i.e in relationnal database, getting the 'script' in a separate table) which would permit things like "pre/post" reform (like in French or German) Simplified/Traditional etc. etc.

jiru commented 9 years ago

Alright, thank you for your comments. I’ll add script detection for Cantonese and Shanghainese.

jiru commented 9 years ago

@tommy-3 @allan-simon I think I’m done with this issue. Could you check out the result on dev.tatoeba.org?

tommy-3 commented 9 years ago

There's still a small problem with sentences that consist only of characters used both in traditional and simplified Chinese (which doesn't necessarily mean that the both versions look the same).

For example, the member egg0073 comes from Taiwan, so his/her sentences should be written with traditional characters. However, this sentence http://tatoeba.org/sentences/show/3718946 is detected as simplified, so that the characters 花 and 吸 don't look like what s/he'd write it.

It's somewhat misleading when you browse sentences owned by a member, for example. http://prntscr.com/66dxlo

jiru commented 9 years ago

So I reported the problem to sysko, but I don’t think this can be easily fixed. We can’t possibly rely on the country of the user, and sinoparserd doesn’t tell us whether the script is clearly detected or not.

Actually, this may be fixed as a part of #77, because we’ll need to store the script of the sentence. So we should eventually be able to manually change the script of a sentence when the autodetection fails.

tommy-3 commented 9 years ago

It would be nice if sinoparserd would return "traditional", "simplified" or "could be both". When it could be both, it might be best to decide based on the other sentences owned by the same member.