epfLLM / Megatron-LLM

distributed trainer for LLMs
Other
529 stars 76 forks source link

Prepend bos token #54

Closed panx27 closed 12 months ago

panx27 commented 1 year ago

In the original Llama repository, a BOS token is prepended during inference, as seen in this code snippet.

Given this, should we also prepend a BOS token for each document during the 2nd stage of pretraining to ensure alignment with the original model's practices?

From prior models such as GPT-2 and BLOOM, a <|endoftext|> token is typically used to delineate separate documents. For example, a common approach is doc1 <eos> doc2 <eos> .... While I'm uncertain about Llama-2's exact handling of this, maybe something like <bos> doc1 <eos> <bos> doc2 <eos> ...?

panx27 commented 12 months ago

Based on this recent work, always adding sink tokens like BOS at the beginning might be helpful. I will close this issue.