Open xiaotaw opened 5 years ago
Thanks a lot for this detailed analysis. We are aware that Wikipedia contains Chinese sentences in traditional and simplified Chinese. The multilingual LASER embeddings should transparently support both and we handled the Chinese Wikipedia as one single "language". What are the alignments scores in the above examples ? As explained in the paper, not all the provided data should be considered as perfectly aligned. Please consider the alignment score: in our experience, sentences with a score higher than 1.04 seem to be good mutual translations.
Update scores. LASER does support Modern Chinese written in either Traditional or Simplified Chinese characters, and we can found many sentence pairs aligned correctly. However, almost all the Classical Chinese sentences (usually written in traditional form) do not match their English, as I list in the table. The Classical Chinese is just as Old English(450-1100 AD), and differs with modern ones.
I was aware of the difference between "traditional" and "simplified" Chinese, and both are supported by LASER, and consequently, their should be well aligned in WikiMatrix. Do you mean that there is a third variant of Chinese, named "Classical" Chinese ? which is significantly different from the former two ? LASER was trained on bitexts from the UN corpus and OpenSubtitles and was probably never exposed to classical Chinese. If this is confirmed, we should probably remove the badly aligned sentences in traditional Chinese from WikiMatrix. Is there a tool to detect the different variants of Chinese (we used fasttext LID) ?
Yes, modern Chinese are quite different with "Classical" Chinese languages. The change results from a New Culture Movement in the 1920's.
I have not found a tool which is out-of-the-box to recognize Classical Chinese , however, wikipeida dump of Classical Chinese corpus may help in two ways: ① just check if the sentence is in the classical corpus ② train a text classifier to distinguish classical from modern Chinese
To clarify if you are not familiar with Chinese (and similar languages. (this is slightly off-topic - no programming involved.)
Sino-Tibetan languages are based on single syllables (to simplify) and contain massive amounts of homophones (words that sound the same since there are only so many ways to create a single syllable). Tones help reducing this somewhat but this is nowhere near enough. When speaking, the context will give an indication as to which homophone is meant. In addition, 2 syllables are often used to reduce ambiguity (in oral situation, 2 syllables/words are joined to form a single word).
When writing, alphabetical systems would be useless: it is much more efficient to use a single character for each syllable: each homophone gets its own different character. Visual specificity removes the phonetic ambiguity. Many Sino-Tibetan language have historically adopted Chinese charaters. For example, Vietnamese used them for a long time. But, languages like Korean and Japanese have long used them for unrelated reasons although they are not Sino-Tibetan languages without such an issue of homophones: Korean in recent times adopted its own phonetic system, Japanese retains a limited set of Chinese characters (limited compared to Chinese) and uses phonetic system for everything else.
Mandarin is the Sino-Tibetan language that was imposed by Chinese emperors as lingua franca to rule their empire (think Latin in Europe, Russian in Sovietic Union), but locally many people still speak a separate language (e.g. Cantonese in Hong Kong/Macao/Guangzhou, Shanghaiese, Hakka....). All use the same characters but they are completely unintelligible: different pronunciation, different grammars.
As a language, modern Mandarin has nothing to do with what Confucius would have spoken. However the same characters have been used during the 2,500 years between him and us. This is the Classical Chinese referred to. But to add to the confusion: (1) the meaning of characters has somewhat shifted over time and (2) since characters are so explicit about their meaning, Written Classical Chinese makes a point of removing any character that is remotely superfluous to understanding (that's the reference to ellipsis).
Cosmetically, same characters, but they reflect completely different languages. The same way the same characters are used to write English, French or German.
Finally Simplified vs Traditional: this has nothing to do with language. The Chinese communist government thought that traditional characters were too complicated, were culturally associated with the old times and were a hinderance to fighting illiteracy. They decided to simplify the characters (in the 50s/60s). This is purely cosmetic and the simplified characters often simply reflect what people were using when handwriting. The Mandarin spoken in Taiwan is written in traditional characters. Mainland uses simplified, but same language.
Basically, Classical Chinese is not Mandarin.
@Emmanuel-R8 Excellent and professional!
Classical Chinese VS Modern Chinese(including daily used Mandarin, Cantonese, etc.):
Similarity: share most of the characters with modern ones
Difference: ① the same character has different meanings between Classical and Modern, just like: 'Gift' means 'present' in English, but 'Gift ' also means 'poison' in German ② different grammar
Hello, I found that Classical Chinese sentences do not match their English in WikiMatrix.en-zh.tsv . e.g.
The Classical Chinese differs with Modern Chinese on vocabulary, grammar, and genre. And ellipsis phenomenons in Classical Chinese make it quite hard be understood, even for people like me, who have learned it from primary school to college.
In my point of view, we should consider Classical Chinese as a independent language. It will improve the quality of extracted en-zh bitexts.