AetherCortex / Llama-X

Open Academic Research on Improving LLaMA to SOTA LLM
Apache License 2.0
1.58k stars 101 forks source link

Concern on the language #2

Open zhhongzhi opened 1 year ago

zhhongzhi commented 1 year ago

Interesting project, but I have some concern on the language. As is known that there are less Chinese tokens in the training data of Llama, and each Chinese token is tokenized into several tokens which is ineffecient in generation. Would the project hand this? e.g. add new tokens and do some pretraining?

yuys0602 commented 1 year ago

yes, we need chinese, more chinese corpus

victorsungo commented 1 year ago

Thanks for the reaching out. This is a good question. As shown in our proposed ten main research aeras, the multilingual is an important challenge for the 1st generation of LLaMA. According to our analysis, a potential thorough solution is to add more high-quality Chinese corpus for additional pre-training, but we should always pay attention to the risk of forgetting the existing capabilities of the model. We need systematic research and hope that more people will participate in discussing and solving this problem thoroughly together on the Llama-X community.

Let's keep this issue item open and look forward to more solid methods.

zhangbo2008 commented 1 year ago

i debug the code of alpaca , its vocabulary is 30k. very small . and some chinese character is tokenized by 2token. its unefficiency. if you wana use new encodign for chinese you perhaps need delete chinese token and add new ones.