Closed abdoelsayed2016 closed 1 year ago
Yes, our main focus is on Chinese. You can refer to this issue There are hundreds of Chinese characters in llama's vocab.txt, you can make these Chinese characters correspond to a token id one by one.For the others, they can be done by 3 token ids corresponding to one Chinese character
i have a question did you test the LLama tokenizer on Chinese
i read the paper and they didnt support Chinese this is the list of language that mention on the paper bg - Bulgarian ca - Catalan cs - Czech da - Danish de - German en - English es - Spanish fr - French hr - Croatian hu - Hungarian it - Italian nl - Dutch pl - Polish pt - Portuguese ro - Romanian ru - Russian sl - Slovenian sr - Serbian sv - Swedish uk - Ukrainian