Open zhhongzhi opened 1 year ago
yes, we need chinese, more chinese corpus
Let's keep this issue item open and look forward to more solid methods.
i debug the code of alpaca , its vocabulary is 30k. very small . and some chinese character is tokenized by 2token. its unefficiency. if you wana use new encodign for chinese you perhaps need delete chinese token and add new ones.
Interesting project, but I have some concern on the language. As is known that there are less Chinese tokens in the training data of Llama, and each Chinese token is tokenized into several tokens which is ineffecient in generation. Would the project hand this? e.g. add new tokens and do some pretraining?