Subdivide Chinese dataset by character sets

peakji commented 1 year ago

For better data consistency, I suggest subdividing Chinese into two datasets:

zh-hans (Chinese written using the Simplified Chinese script)
zh-hant (Chinese written using the Traditional Chinese script)

Although their grammars are similar, there are a large number of characters that do not overlap. For example 国 ("nation" in SC) and 國 ("nation" in TC) have the same meaning but different Unicode codepoints. When training the model, different tokens will be translated into different embeddings, but there is no standard "normalziation mapping" between Simplified Chinese and Traditional Chinese due to some wording conventions, e.g. 鼠标 vs 滑鼠. Therefore the relationship between Simplified Chinese and Traditional Chinese has been seen as a translation rather than a simple case-folding.

As mentioned in IETF BCP 47, some other languages like Serbian have similar issues. I'm not sure if the difference between them is as significant as the Chinese writing systems, but I guess the impact on the trained model could be minor as their "alphabet" is way smaller than Chinese. In fact, Simplified Chinese and Traditional Chinese characters make up a large part of the vocabulary file of mBERT and XLM.

I have many years of experience in multilingual LM and information retrieval, and would really like to contribute to this project. In addition to technical contributions, I'm also willing to contribute Chinese translation and spread the word in the local NLP community.

bitplane commented 1 year ago

Chatting about this on Discord, the data/social issue of doing this now means data will be split and it'll slow the take-off of collection in Chinese.

Are there technical/data quality issues around this too? Like does one map to fewer GPT tokens? Would transliterating simplified to traditional or vice versa exclude people or harm data collection more than having more users would benefit?

Is it something we can do later on, or are we causing data pollution by not doing it now?

peakji commented 1 year ago

It depends on our requirements for data quality.

The relationship between Simplified and Traditional Chinese is more like translation than transliteration, one-to-one mapping between characters cannot be simply relied upon. This not only involves the difference in terms as I mentioned above (e.g. 鼠标 vs 滑鼠), but also the overall language habits of different regions.

I think it is totally fine to handle it while post-processing, as long as we have a reliable machine translation solution.

lone-wolf-akela commented 1 year ago

I've noticed that there are conversations like the one below appearing in the collected data: 网页捕获_14-2-2023_2376_open-assistant io Here, the user is asking a question in traditional Chinese, and then the assistant answers in simplified Chinese. The user then tried to correct the assistant and asked it to re-answer in traditional Chinese. I wonder if this kind of training data will automatically teach the model to learn when to answer in which kind of Chinese. And also, if we use some machine translation solution to preprocess these training data into traditional or simplified Chinese, this whole conversation will become to make no sense after the translation.

LAION-AI / Open-Assistant

Subdivide Chinese dataset by character sets #1145