模型配置文件和tokenizer配置文件中的bos_token_id不一致

QwenLM / Qwen2

Qwen2 is the large language model series developed by Qwen team, Alibaba Cloud.

7.43k stars 454 forks source link

no. let me explain this.

during pretraining, unlike llama, we do not have the concept of bos and eos. what we only have is the eod, which refers to <|endoftext|>. this is the separator for documents. however, to adapt our model to the llama usage, we use the same eod token for bos and eos. you can just simply regard the bos as some useless stuff.

but for <|im_start|> and <|im_end|>, they are used for posttraining only and they indicate the start and the end of each turn. this is following the practice of chatml format by openai (learn about it by searching chatml online). for chat, we mostly use <|im_end|> for the stopping criteria.

QwenLM / Qwen2

模型配置文件和tokenizer配置文件中的bos_token_id不一致 #369