facebookresearch / LASER

Language-Agnostic SEntence Representations
Other
3.6k stars 463 forks source link

Consider Classical Chinese as an independent language #84

Open xiaotaw opened 5 years ago

xiaotaw commented 5 years ago

Hello, I found that Classical Chinese sentences do not match their English in WikiMatrix.en-zh.tsv . e.g.

num of lines English Chinese score Chinese Reference Translation in Mandarin Translation in English
9 He is sometimes seen having arguments with his lord. 正月,朝見其主(王)龔。 1.20 A historical record ( 后汉书 quoted by 风俗通义校注 ) 一月份,他拜见了他的君主(王)龚。 In January, he had an audience with his load (Wang) Gong.
11 Owain is appointed lord. 以邑名為氏。 1.19 A record about the origin of last names 以地方的名字作为姓氏 Take the County Name as their first name.
12 Nevertheless when it shall turn to the Lord, the vail shall be taken away. 當體聖主好生之德,俟其向化。 1.19 A war in Qing danysty: 清平王辅臣之战 应当体现君主的好生之德,等待他们归服。 It is appropriate to embody the King's mercy on livings, and wait for their come over and pledge allegiance.

The Classical Chinese differs with Modern Chinese on vocabulary, grammar, and genre. And ellipsis phenomenons in Classical Chinese make it quite hard be understood, even for people like me, who have learned it from primary school to college.

In my point of view, we should consider Classical Chinese as a independent language. It will improve the quality of extracted en-zh bitexts.

hoschwenk commented 5 years ago

Thanks a lot for this detailed analysis. We are aware that Wikipedia contains Chinese sentences in traditional and simplified Chinese. The multilingual LASER embeddings should transparently support both and we handled the Chinese Wikipedia as one single "language". What are the alignments scores in the above examples ? As explained in the paper, not all the provided data should be considered as perfectly aligned. Please consider the alignment score: in our experience, sentences with a score higher than 1.04 seem to be good mutual translations.

xiaotaw commented 5 years ago

Update scores. LASER does support Modern Chinese written in either Traditional or Simplified Chinese characters, and we can found many sentence pairs aligned correctly. However, almost all the Classical Chinese sentences (usually written in traditional form) do not match their English, as I list in the table. The Classical Chinese is just as Old English(450-1100 AD), and differs with modern ones.

hoschwenk commented 5 years ago

I was aware of the difference between "traditional" and "simplified" Chinese, and both are supported by LASER, and consequently, their should be well aligned in WikiMatrix. Do you mean that there is a third variant of Chinese, named "Classical" Chinese ? which is significantly different from the former two ? LASER was trained on bitexts from the UN corpus and OpenSubtitles and was probably never exposed to classical Chinese. If this is confirmed, we should probably remove the badly aligned sentences in traditional Chinese from WikiMatrix. Is there a tool to detect the different variants of Chinese (we used fasttext LID) ?

xiaotaw commented 5 years ago

Yes, modern Chinese are quite different with "Classical" Chinese languages. The change results from a New Culture Movement in the 1920's.

I have not found a tool which is out-of-the-box to recognize Classical Chinese , however, wikipeida dump of Classical Chinese corpus may help in two ways: ① just check if the sentence is in the classical corpus ② train a text classifier to distinguish classical from modern Chinese

Emmanuel-R8 commented 4 years ago

To clarify if you are not familiar with Chinese (and similar languages. (this is slightly off-topic - no programming involved.)

Sino-Tibetan languages are based on single syllables (to simplify) and contain massive amounts of homophones (words that sound the same since there are only so many ways to create a single syllable). Tones help reducing this somewhat but this is nowhere near enough. When speaking, the context will give an indication as to which homophone is meant. In addition, 2 syllables are often used to reduce ambiguity (in oral situation, 2 syllables/words are joined to form a single word).

When writing, alphabetical systems would be useless: it is much more efficient to use a single character for each syllable: each homophone gets its own different character. Visual specificity removes the phonetic ambiguity. Many Sino-Tibetan language have historically adopted Chinese charaters. For example, Vietnamese used them for a long time. But, languages like Korean and Japanese have long used them for unrelated reasons although they are not Sino-Tibetan languages without such an issue of homophones: Korean in recent times adopted its own phonetic system, Japanese retains a limited set of Chinese characters (limited compared to Chinese) and uses phonetic system for everything else.

Mandarin is the Sino-Tibetan language that was imposed by Chinese emperors as lingua franca to rule their empire (think Latin in Europe, Russian in Sovietic Union), but locally many people still speak a separate language (e.g. Cantonese in Hong Kong/Macao/Guangzhou, Shanghaiese, Hakka....). All use the same characters but they are completely unintelligible: different pronunciation, different grammars.

As a language, modern Mandarin has nothing to do with what Confucius would have spoken. However the same characters have been used during the 2,500 years between him and us. This is the Classical Chinese referred to. But to add to the confusion: (1) the meaning of characters has somewhat shifted over time and (2) since characters are so explicit about their meaning, Written Classical Chinese makes a point of removing any character that is remotely superfluous to understanding (that's the reference to ellipsis).

Cosmetically, same characters, but they reflect completely different languages. The same way the same characters are used to write English, French or German.

Finally Simplified vs Traditional: this has nothing to do with language. The Chinese communist government thought that traditional characters were too complicated, were culturally associated with the old times and were a hinderance to fighting illiteracy. They decided to simplify the characters (in the 50s/60s). This is purely cosmetic and the simplified characters often simply reflect what people were using when handwriting. The Mandarin spoken in Taiwan is written in traditional characters. Mainland uses simplified, but same language.

Basically, Classical Chinese is not Mandarin.

xiaotaw commented 4 years ago

@Emmanuel-R8 Excellent and professional!

Classical Chinese VS Modern Chinese(including daily used Mandarin, Cantonese, etc.):

Similarity:  share most of the characters with modern ones

Difference: ① the same character has different meanings between Classical and Modern, just like: 'Gift' means 'present' in English, but 'Gift ' also means 'poison' in German ② different grammar