is LLama tokenizer support Chinese?

Facico / Chinese-Vicuna

Chinese-Vicuna: A Chinese Instruction-following LLaMA-based Model —— 一个中文低资源的llama+lora方案，结构参考alpaca

https://github.com/Facico/Chinese-Vicuna

Apache License 2.0

4.14k stars 421 forks source link

is LLama tokenizer support Chinese? #78

Closed abdoelsayed2016 closed 1 year ago

abdoelsayed2016 commented 1 year ago

i have a question did you test the LLama tokenizer on Chinese

i read the paper and they didnt support Chinese this is the list of language that mention on the paper bg - Bulgarian ca - Catalan cs - Czech da - Danish de - German en - English es - Spanish fr - French hr - Croatian hu - Hungarian it - Italian nl - Dutch pl - Polish pt - Portuguese ro - Romanian ru - Russian sl - Slovenian sr - Serbian sv - Swedish uk - Ukrainian

Facico commented 1 year ago

Yes, our main focus is on Chinese. You can refer to this issue There are hundreds of Chinese characters in llama's vocab.txt, you can make these Chinese characters correspond to a token id one by one.For the others, they can be done by 3 token ids corresponding to one Chinese character