QwenLM / Qwen2

Qwen2 is the large language model series developed by Qwen team, Alibaba Cloud.
7.43k stars 454 forks source link

模型配置文件和tokenizer配置文件中的bos_token_id不一致 #369

Closed hzhaoy closed 3 months ago

hzhaoy commented 4 months ago

模型配置文件(例如Qwen1.5-7B-Chat中的config.json)中,bos_token_id为151643:"bos_token_id": 151643 image

tokenizer配置文件(例如Qwen1.5-7B-Chat中的tokenizer_config.json)中,id(151643)对应的是"<|endoftext|>","<|im_start|>"对应的id为151644: image 另外这里的"<|im_start|>"是否可以理解为bos_token,如果是这样的话,为何在tokenizer配置文件中bos_token设置为null,有点不太理解,盼各位大神解惑,谢谢!

JustinLin610 commented 4 months ago

no. let me explain this.

during pretraining, unlike llama, we do not have the concept of bos and eos. what we only have is the eod, which refers to <|endoftext|>. this is the separator for documents. however, to adapt our model to the llama usage, we use the same eod token for bos and eos. you can just simply regard the bos as some useless stuff.

but for <|im_start|> and <|im_end|>, they are used for posttraining only and they indicate the start and the end of each turn. this is following the practice of chatml format by openai (learn about it by searching chatml online). for chat, we mostly use <|im_end|> for the stopping criteria.