CVI-SZU / Linly

Chinese-LLaMA 1&2、Chinese-Falcon 基础模型;ChatFlow中文对话模型;中文OpenLLaMA模型;NLP预训练/指令微调数据集
3.03k stars 235 forks source link

Are the tokenizer.model the same with the one in llama-7b? #118

Open treya-lin opened 1 year ago

treya-lin commented 1 year ago

Hi, In the README it says:

针对中文优化字词结合tokenizer

so previously I thought it meant that all tokenizer.model released with linly's models are different from the original llama-7b, is that so?

But recently I checked the repo mentioned in this documentation: https://github.com/CVI-SZU/Linly/wiki/%E5%A2%9E%E9%87%8F%E8%AE%AD%E7%BB%83

I used diff to check the tokenizer.model downloaded from these two repositories, and also the one from Chatflow-7b's repo, but it seems they are all the same (diff showed no difference)

I am a bit confused, are they actually the same or different? If different how may I test it?