TigerResearch / TigerBot

TigerBot: A multi-language multi-task LLM
https://www.tigerbot.com
Apache License 2.0
2.24k stars 194 forks source link

13b-chat使用llama.cpp转换模型的时候,词表长度对不上 #110

Open iDonal opened 1 year ago

iDonal commented 1 year ago

Exception: Vocab size mismatch (model has 60928, but /home/rsync_user/tigerbot-13b-chat/tokenizer.model combined with /home/rsync_user/tigerbot-13b-chat/added_tokens.json has 60515).

提供的词表tokens 总数应该是60515 模型词表宽度是60928

iDonal commented 1 year ago

转换脚本: https://github.com/ggerganov/llama.cpp/blob/master/convert.py

i4never commented 1 year ago

你好,这是因为pretrain时候对词表进行了分片,由于分片均匀要求,对embedding和lm head做了pad处理。可以试试用以下代码转换后的模型:

import transformers

model = transformers.AutoModelForCausalLM.from_pretrained('/home/rsync_user/tigerbot-13b-chat', torch_dtype=torch.bfloat16)
model.resize_token_embeddings(60515)
model.save_pretrained('./tigerbot-13b-chat-vocab-60515')

resize会丢弃embedding与lm_head的末尾pad部份。 https://github.com/huggingface/transformers/blob/869733ab621495b938d0754176f7f1e360ae7ea9/src/transformers/modeling_utils.py#L1581

iDonal commented 1 year ago

可以了通过llama.cpp转换了,感谢!