Closed hzhaoy closed 3 months ago
no. let me explain this.
during pretraining, unlike llama, we do not have the concept of bos and eos. what we only have is the eod, which refers to <|endoftext|>
. this is the separator for documents. however, to adapt our model to the llama usage, we use the same eod token for bos and eos. you can just simply regard the bos as some useless stuff.
but for <|im_start|>
and <|im_end|>
, they are used for posttraining only and they indicate the start and the end of each turn. this is following the practice of chatml format by openai (learn about it by searching chatml online). for chat, we mostly use <|im_end|>
for the stopping criteria.
模型配置文件(例如Qwen1.5-7B-Chat中的config.json)中,bos_token_id为151643:
"bos_token_id": 151643
tokenizer配置文件(例如Qwen1.5-7B-Chat中的tokenizer_config.json)中,id(151643)对应的是"<|endoftext|>","<|im_start|>"对应的id为151644: 另外这里的"<|im_start|>"是否可以理解为bos_token,如果是这样的话,为何在tokenizer配置文件中bos_token设置为null,有点不太理解,盼各位大神解惑,谢谢!